水平スケーリングによるパフォーマンス向上 If a custo If m a er c us ha to s fmer ou ha nd s tha fou t nd the tha perf t the orm pe anc rf e orm levean ls ce lev are e ac l c s are eptablac e, cep bu tab t le, wantsbu t to iwant ncr s ea to s e inc c r ap ea acis t e y c b ap y ac 25 ity %, by the 25 y %, they could a c dd ou an ld a oth dd er an 4, oth 1 T er B 4, driv 1 esT B to driv eac es h s to erveac er, h a serv nd er, will a no nd t g wi en ll no erall t y gen experal eriely ncex e perie perf nc orme pe anc rf e orm de ance gradati de on grada . (i.e. ti , on. (i.e., each serveac er h serv would er h wou ave ld 16, hav 1 T e B 16, driv 1 es T ) B . ( driv See es) Co . ( nf S i ee g. B Co , nf ab ig. ove)B . , ab Not ov e e th ). at Not the e y th d at o the not y n d ee o d no to t u nee pgr d ad to e u to pg la rad ge e r, to larger, or m or m ore po ore po werful ha wer rd ful war ha e, r th d e war y si e m, th ply e y a s ddi mply 8 m add ore i 8 m nex ore i pensi ne ve xpe SA ns TA ive dri S v A es T . A drives. On the o On ther the h ano dthe , if r han the c d, us if the tom c er i us s tom ha er ppy is wi tha h pp 24 y T wi B th of 24 c T ap B ac itof y , c bap ut ac w it a y, ntsb ut to wa do nts ubl to e do perf ub or l me pe anc rfor e, m the an y c c e, oul the d y could distribute di t str he ibute driv t es he dri am v on es g 4 am s on erv g ers 4 , rserv ath ers er , r tha ath n er 2 s tha erv n ers 2 (i serv .e. ers eac ( h i.e. ser ea v c er h ser woul v d er ha woul ve 6,d ha 1 T ve B 6, dri v1 esT , B r dri ath v eres , rather than 1 tha 2). n 1 Note 2). th Note at in th this at cain s thi e, s the c y a se, are the a y ddi are ng 2 add m ing ore l 2 ow m - ore price low s -pric erverse , s a erv nd ers c , an a sind m c pl an y r s ed im e pl ploy red exi e s pl tino gy existing drives. (S driv e es e Co . ( nf S i e g. e Co C, nf aboig. ve)C, above) If they If the want t y o want both t quo bo adr th upl qu e ad perfruple orm pe anc rf e orm an an d c qu e a and drupl qu e c adrupl apacit e y, cap the ac y i c ty, ou lthe d y di s c tr ou ib ld ute distr am ib onute g 8 am s on erv g ers 8 (iserv .e. ers (i.e. each serveac er h s w erv ould er ha w v ou e ld 12, ha 1 T ve B 12 driv,1 T es) B . ( dri Se ves) e Co . ( nf S i e g. e Co D, nf bel i og. w) D , below)
Note tha Note t by ththa e tit b m y e th a e sotlim uti e o a n so ha l s uti o ap n ha proxi s m approx ately im 10 ate driv ly es 10 , th dri e v p es erf , th or e m p an erf ce or b m ott an l c en e ec b k ottl ha en s ec g k en ha eral s l y gen alr eral ead l y y already moved to mov th ed e n to etw th orke n . ( et S w ee ork. Con (fS i ee g. Con D, a f boig. ve)D , above) Copyright Cop 201 y 1,right Glu 2 s 0 te 11 r, I , Glu nc. ster, Inc.
Page 6 Page 6 www.glus ww ter.cw o .g m l uster.com ! !
If a customer has found that the performance levels are acceptable, but wants to increase capacity by 25%, they could add another 4, 1 TB drives to each server, and will not generally experience performance degradation. (i.e., each server would have 16, 1 TB drives). (See Config. B, above). Note that they do not need to upgrade to larger, or more powerful hardware, they simply add 8 more inexpensive SATA drives. On the other hand, if the customer is happy with 24 TB of capacity, but wants to double performance, they could distribute the drives among 4 servers, rather than 2 servers (i.e. each server would have 6, 1 TB drives, rather than 12). Note that in this case, they are adding 2 more low-price servers, and can simply redeploy existing drives. (See Config. C, above) アプローチ If they want to both quadruple performance and quadruple capacity, they could distribute among 8 servers (i.e. each server would have 12,1 TB drives). (See Config. D, below)
水平スケーリングによるキャパシティの増加 Note that by the time a solution has approximately 10 drives, the performance bottleneck has generally already moved to the n レ etwork. プリケ (See Config. D, a ーションによbove) るデータ保全性の増加 Copyright 2011, Gluster, Inc.
Page 6 www.gluster.com !
So, in order to maximize performance, we can upgrade from a 1 Gigabit Ethernet network to a 10 Gigabit Ethernet network. Note that performance in this example is more than 25x that which we saw in the baseline. This is evidenced by an increase in performance from 200 MB/s in the baseline configuration to 5,000 MB/s. (See Config. E, below) アプローチ
As you wil note, the power of the scale-out model is that both capacity and performance can scale linearly to meet requirements. It is not necessary to know what performance levels wil be needed 2, or 3 years out. Instead, InﬁniBandによる転送速度増加によるパフォーマンス向上 configurations can be easily adjusted as the need demands.
Copyright 2011, Gluster, Inc.
Page 7 www.gluster.com !
特徴  SOFTWARE ONLY ・OS や特定ハードウェアに依存しない ・Amazon Web Service や VMWare にもデプロイ可能  OPEN SOURCE ・AGPL ライセンス  COMPLETE STORAGE OS STACK ・distributed memory management ・I/O scheduling・software RAID・sellf-healing
特徴  USER SPACE (<=> KERNEL SPACE) ・インストール、アップグレードが容易 ・カーネルの知識がなくてもCの知識があれば良い  MODULAR, STACKABLE ARCHITECTURE ・large number of large ﬂies ・huge numbers of very small ﬁles ・environments with cloud storage ・various transport protocols
特徴  DATA STORED IN NATIVE FORMATS ・EXT3, EXT4, XFS に対応 ・POSIX 準拠による ShellScripting  NO METADATA WITH THE ELASTIC HASH ALGORITHM ・Elastic Hashing Algorithm によるファイルロケーションの決定 ・i.e. メタデータを保存するための特定のサーバーが不要 ・i.e. No Single Point Of Failure !!
Replicate-Volume 用語 Geo-Replication glusterfs におけるストレージの単位。 Brick Trusted Storage Pool 内のサーバーのディ レクトリパスで表現。 glusterd Gluster management daemon 、Trusted Storage Pool 内の全てのサーバーで起動。 Server の 各 Brick にデータが分散ストレー Server / Client ジされる。Client はVolume を mount する ことでデータにアクセス可能。 Trusted Storage Pool A storage pool is a trusted network of storage servers. Volume Brick の logical collection、Client は Volume 単位で mount を行う。 Volﬁle glusterfs プロセスが使用する設定ファイル /etc/glusterd/vols/VOLNAME
Geo-Replication Volume ・(Distributed) Striped Volume ・Geo-Replication Replicate-Volume Geo-Replication 地理的に離れたクラスタ間の クラスタ間のミラーリング ミラーリング backing up of data for high-availability disaster recovery Synchronous replication Asynchronous replication
ds Checking Geo-replication Minimum Requirements 9.2.1. Exploring Geo-replication Deployment Scenarios GlusterFS Geo-replication provides an incremental replication service over Local Area Networks (LANs), Wide Area Network (WANs), and across the Internet. This section illustrates the most common deployment scenarios for GlusterFS Geo-replication, including the following:
Geo-replication over LAN
ds Geo-replication over WAN Geo-replication over the Internet Checking Geo-replication Minimum Requirements Multi-site cascading Geo-replication 9.2.1. Exploring Geo-replication Deployment Scenarios Geo-replication over LAN GlusterFS Geo-replication provides an incremental replication service over Local Area Networks (LANs You ), W can ide A conf riea N gur etw e Gluosrtk ( er W F ANs S Ge ), o a -r n e d acro plicati s os th n e I to n mitrer ron r et. T datahis o s v ec er ti a on Lo il calus Atra r tes t ea Ne he tw mo ork.s t
common deployment scenarios for GlusterFS Geo-replication, including the following: Geo-replication over LAN Geo-replication over WAN Geo-replication over the Internet Multi-site cascading Geo-replication Geo-replication over LAN You can configure GlusterFS Geo-replication to mirror data over a Local Area Network. Geo-replication over WAN You can configure GlusterFS Geo-replication to replicate data over a Wide Area Network.
Geo-replication over WAN
You can configure GlusterFS Geo-replication to replicate data over a Wide Area Network.
Geo-replication over Internet
Y dso u can configure GlusterFS Geo-replication to mirror data over the Internet.
Geo-replication over Internet Gluster File system Administration Guide_3.2_02_B Pg No. 47 You can configure GlusterFS Geo-replication to mirror data over the Internet.
Multi-site cascading Geo-replication You can configure GlusterFS Geo-replication to mirror data in a cascading fashion across multiple sites.
Gluster File system Administration Guide_3.2_02_B Pg No. 47
9.2.2. GlusterFS Geo-replication Deployment Overview Deploying GlusterFS Geo-replication involves the following steps:
1. Verify that GlusterFS v3.2 is installed and running on systems that will serve as masters, using the following command: #glusterfs --version
Gluster File system Administration Guide_3.2_02_B Pg No. 48
3. No Metadata Server Approach
CENTRALIZED METADATA SYSTEMS ! Figure!4!Centralized!Metadata!Approach
3.2 DISTRIBUTED METADATA SYSTEMS An alternative approach is to forego a centralized metadata server in favor of a distributed metadata approach. In this implementation, the index of location metadata is spread among a large number of storage systems. While this approach would appear on the surface to address the shortcomings of the centralized approach, it introduces an entirely new set of performance and availability issues. 1. Performance Overhead: Considerable performance overhead is introduced as the various distributed systems try to stay in sync with data via the use of various locking and synching mechanisms. Thus, most of the performance scaling issues that plague centralized metadata systems plague distributed metadata systems as well. Performance degrades as there is an increase in files, file operations, storage systems, disks, or the randomness of I/O operations. Performance similarly degrades as the average file size decreases. While some systems attempt to counterbalance these effects by creating dedicated solid state drives with high performance internal networks for metadata, this approach can become prohibitively expensive. 2. Corruption Issues: Distributed metadata systems also face the potential for serious corruption issues. While entire system. When metadata is stored in multiple locations, the requirement to maintain it synchronously also implies significant risk related to situations when the metadata is not properly kept in synch, or in the event it is actually damaged. The worst possible scenario involves apparently-successful updates to file data and metadata to separate locations, without correct synchronous maintenance of metadata, such that there is no longer perfect agreement among the multiple instances. Furthermore, the chances of a corrupted storage system increase exponentially with the number of systems. Thus, concurrency of metadata becomes a significant chal enge. Copyright 2011, Gluster, Inc.
Page 13 www.gluster.com !
Figure 5, below, il ustrates a typical distributed metadata server implementation. It can be seen that this approach also results in considerable overhead processing for file access, and by design has built-in exposure for corruption scenarios. Here again we see a legacy approach to scale-out storage not congruent with the requirement of the modern data center or with the burgeo DISTRIBUTED MET ning migrati ADA on T to virtualization and cloud compu A SYSTEMS ting.
! Figure!5!Decentralized!Metadata!Approach 3.3 AN ALGORITHMIC APPROACH (NO METADATA MODEL) As we have seen so far, any system which separates data from location metadata introduces both performance and reliability concerns. Therefore, Gluster designed a system which does not separate metadata from data, and which does not rely on any separate metadata server, whether centralized or distributed. Instead, Gluster locates data algorithmically. Knowing nothing but the path name and file name, any storage system node and any client requiring read or write access to a file in a Gluster storage cluster performs a mathematical operation that calculates the file location. In other words, there is no need to separate location metadata from data, because the location can be determined independently. We call this the Elastic Hashing Algorithm, and it is key to many of the unique advantages of Gluster. While a complete explanation of the Elastic Hashing Algorithm is beyond the scope of this document, the following is a simplified explanation that should il uminate some of the guiding principles of the Elastic Hashing Algorithm.
Copyright 2011, Gluster, Inc.
Page 14 www.gluster.com !
ALGORITHMIC APPROACH (NO METADATA SERVER MODEL) [Elastic Hashing Algorithm] ・それぞれのオペレーションが個々にメタデータをアルゴリ ズムによって特定。よって高速 ・メタデータの特定はデータの増大やサーバーの増加に対す るパフォーマンスの影響が小さく、ほぼ linear にスケール ・メタデータの不一致などによる矛盾を引き起こしにく い。 Safe な分散ファイルシステムを実現
The benefits of the Elastic Hashing Algorithm are fourfold:
1. The algorithmic approach makes Gluster faster for each individual operation, because it calculates metadata using an algorithm, and that approach is faster than retrieving metadata from any storage media. 2. The algorithmic approach also means that Gluster is faster for large and growing individual systems because there is never any contention for any single instance of metadata stored at only one location. 3. The algorithmic approach means Gluster is faster and achieves true linear scaling for distributed deployments, because each node is independent in its algorithmic handling of its own metadata, eliminating the need to synchronize metadata. 4. Most importantly, the algorithmic approach means that Gluster is safer in distributed deployments, because it eliminates al scenarios of risk which are derived from out-of-synch metadata (and that is arguably the most common source of significant risk to large bodies of distributed data). To explain how the Elastic Hashing Algorithm works, we will examine each of the three words (algorithm, hashing, and elastic.) We are all familiar with an algorithmic approach to locating data. If a person goes into any office that stores physical documents in folders in filing cabinets, that person should be able to find the ALGORITHMIC APPROACH -
Similarly, one could implement an algorithmic approach to data storage that used a similar locate files. For example, (NO MET in a ten sy ADAstem T cluster, one A SERVER MODEL) isk 10, etc. Figure 6, below illustrates this concept. !  辞書式順序によってロケーションを特定 Figure!6:!Understanding!EHA:!Algorithm Acme なら Disk1 へ、 Do なら Disk2 へ Copyright 2011, Gluster, Inc.
Page 15 www.gluster.com !
Because it is easy to calculate where a file is located, any client or storage system could locate a file based solely on its name. Because there is no need for a separate metadata store, the performance, scaling, and single point- of-failure issues are solved. Of course, an alphabetic algorithm would never work in practice. File names are not themselves unique, certain letters are far more common than others, we could easily get hotspots where a group of files with similar names are stored, etc.
3.4 THE USE OF HASHING
To address some of the abovementioned shortcomings, you could use a hash-based algorithm. A hash is a mathematical function that converts a string of an arbitrary length into a fixed length values. People familiar with hash algorithms (e.g. the SHA-1 hashing function used in cryptography or various URL shorteners like bit.ly), wil know that hash functions are generally chosen for properties such as determinism (the same starting string wil always result in the same ending hash), and uniformity (the ending results tend to be uniformly distributed mathematically) -Meyer hashing algorithm, In the Gluster algorithmic approach, we take a given pathname/filename (which is unique in any directory tree) and run it through the hashing algorithm. Each pathname/filename results in a unique numerical result. ALGORITHMIC APPROACH For the sake of simplicity, one could imagine assigning all files whose hash ends in the number 1 to the first disk, all which end in the number 2 to the second disk, etc. Figure 7, below, illustrates this concept. (NO METADATA SERVER MODEL)
!  pathname/ﬁlename に基づく Hash Based Figure!7!Understanding!EHA:!Hashing アルゴリズム。この例は Mod10 Copyright 2011, Gluster, Inc.
Of course, questions still arise. What if we add or delete physical disks? What if certain disks develop hotspots? To a
3.5 MAKING IT ALL ELASTIC: PART I In the real world, stuff happens. Disks fail, capacity is used up, files need to be redistributed, etc. Gluster addresses these challenges by: 1. Setting up a very large number of virtual volumes 2. Using the hashing algorithm to assign files to virtual volumes ALGORITHMIC APPROACH 3. Using a separate process to assign virtual volumes to multiple physical devices Thus, when disks or nodes are added or deleted, the algorithm itself does not need to be changed. However, virtual volumes can be migrated or assigned to new physical locations as the need arises. Figure 8, below, (NO METADATA MODEL) illustrates the Glus
!  アルゴリズム不変。Virtual Figure!8!Understanding!EHA:!Elasticity Volume 内での物理配置を変更 For most people, the preceding discussion should be sufficient for understanding the Elastic Hashing Algorithm. It oversimplifies in some respects for pedagogical purposes. (For example, each folder is actually assigned its own hash space.). Advanced discussion on Elastic Volume Management, Moving, or Renaming, and High Availability follows in the next section, Advanced Topics. Copyright 2011, Gluster, Inc.