Flat Datacenter Storage

Flat Datacenter Storage
Edmund B. Nightingale, Jeremy Elson, and Jinliang Fan, Owen Hofmann, Jon Howell and Yutaka Suzue Presented by Kuangyuan Chen, Qi Gao EECS 582 – W16

Outline Motivation Overview Distributed Metadata Management
Dynamic Work Allocation Replication and Failure Recovery Performance Evaluation

Motivation move computation to data?
because of bandwidth shortage in the datacenter e.g. MapReduce locality constraints hinder efficient resource utilization stragglers, retasking As we have seen in the mapreduce paper, we learned that network bandwidthis a scarce resource. This restriction leads to the design of moving computation to data in the MapReduce paper. However, this optimization is not free. We gain performance through locality at the expense of efficient resource utilization. One example is straggler. If there exists a slow machine, the entire job cannot complete until the slow machine finishes, while most of other machines are idle. The common solution is to re-execute the task in the other machine. But because of the locality constraints, we have to move the data first before we can restart the task on another machine. This is expensive but worthwhile if network bandwidth is a scarce resource. But what if datacenters are no longer in shortage of network bandwidth.

Motivation CLOS network supports full bisection bandwidth
CLOS network can provide full bisection bandwidth, which make datacenter bandwidth abundant. A typical CLOS network looks like this. On the two sides are Top of Rack Routers, and each of them are connected to many spine routers in the middle. Network traffic can be balanced between different spine routers, thus providing full bisection bandwidth. Top of Rack(TOR) routers spine routers

Flat Datacenter Storage
All compute nodes can access all data with equal throughput Simple and easy to program All data operations are remote. All machine have as much network bandwidth as disk bandwidth. Flat datacenter storage is based on such an assumption that network bandwidth is abundant. In FDS, all compute nodes can access all data with equal throughput. Therefore, applications are programmed without the consideration of locality. More specifically, all data are stored in remote servers , and network bandwidth is as much as disk bandwidth for each machine.

Overview Logically Centralized Storage Array Blob Tract FDS API
Per-Blob Metadata 8MB Blob 0x5fab97ff da5c7c00: Tract -1 Tract 0 Tract 1 ... Tract N Logically Centralized Storage Array Blob byte sequence named with GUID Tract unit of read and write constant sized (e.g. 8MB) FDS API e.g. CreateBlob(), WriteTract(), ReadTract() asynchronous/non-blocking: can be issued in parallel Basically, FDS provides the abstraction of a logically centralized storage array. Data are logically stored in blobs with a global unique identifier. A blob contains a variant number of tracts. Reads and writes are done in the unit of tracts, which are fixed-size and 8MB in this paper. Users interact with FDS through a set of APIs provided by FDS and these APIs are non-blocking, they can be issued in parallel and the underlying storage system can process these request concurrently.

Distributed Metadata Management
Tractserver a process that manages a disk lay out tracts on the disk directly using raw disk interface Tract Locator Table (TLT) a list of active tractservers Tract_Locator = (Hash(GUID) + i) mod TLT_Length deterministic, and produce uniform disk utilization One key feature of FDS is its distributed metadata management. There are several components cooperating to achieve distributed metadata management. A tractsever is a process residing with a disk. It manages the disk and services the reads and writes from clients. Tracts in the disk are accessed by the tractserver directly through raw disk interface. So if a client want to read a tract, in addition to the tract number, the only thing it need to know is the address of that specific tractserver. Such information is kept in a data structure called tract locator table. A tractserver can be located by computing the tract locator, which is a hash function of blob’s GUID and tract number. One thing to notice is that the process of finding a tractserver is deterministic. This means we don’t have to consult a centralized metadata server every time we issue read and write. And this eliminates the bottleneck bottleneck a centralized metadata server. Moreover, hash function is used to randomize data accesses among different tractservers , and thus improving the disk utilization.

Tract Locator List Locator(Row) Version Disk 1 Disk 2 Disk 3 122 A G H
122 A G H 1 5 B D F 2 6 C T 3 E V 4 373 R ... TLT_Length 160 U I Tractserver Address Here is an example of the tract locator list. We compute the row number using the hash function. Within each entry, there are the addresses of trackserver that contain the tract. there might be multiple tractservers in an entry, which serves as replication. The version number in each entry is for failure recovery, which Qi will introduce later. As you can see, TLT only keeps information about tractservers, so size is relatively small which makes such a design scalable. Tract_Locator = (Hash(GUID) + i) mod TLT_Length Tractserver versioning for failure recovery

Distributed Metadata Management(cont)
Metadata Server create TLT by balancing among tractservers distribute TLT to clients assign version number to tractservers in critical path only at client startup tract locator table is created and managed by the metadata server. And it will distribute TLT to clients when they start. Since each client has a copy of TLT, and in normal cases this table will not change, further read and write operations can go directly to tractservers without going to the metadata server. So the metadata is essentially distributed and clients can fully utilize the network bandwidth.

Dynamic Work Allocation
mitigate stragglers decouple data and computation assign work to workers dynamically and at fine granularity reduce dispersion to time of a single work unit Another important feature is dynamic work allocation. Since now data and computation are decoupled, retasking have very low cost and then we can retask work with higher flexibility. FDS divides work into small units and assigns them to clients dynamically. It assigns a unit to a client upon the completion of the previous unit. So even if there is a straggler, only a small number of work units will be run on that straggler. So the effect of stragglers will not be significant. Next, Qi will talk about replication and failure recovery.

Replication When a disk fails, redundant copies of the lost data are used to restore the data to full replication.

Replication As long as the lost data tracts are restored somewhere in the system, we are good.

Replication All disk pairs appear in the table O(n^2) table size
When a disk fails, the lost data can be recovered using the rest of disks in parallel Locator Disk 1 Disk 2 Disk 3 1 A B C 2 Z 3 D H 4 E M 5 F Y 6 G ... 1234 W Q 1235 X 1236 U

Failure Recovery - Metadata Server
Increment the version number of each row in which the failed tractserver appears Pick random tractservers to fill in the empty spaces in the TLT Sends updated TLT assignments to every server affected by the changes Wait for each tractserver to ack the new TLT assignments, and then begins to give out the new TLT to clients when queried for it Locator Version Disk 1 Disk 2 Disk 3 1 8 A B C 2 17 Z 3 324 D H 4 E M 5 456 F Y 6 7 G ... 1234 W Q 1235 43 X 1236 U Locator Version Disk 1 Disk 2 Disk 3 1 9 A B C 2 18 Z 3 325 D H 4 E M 5 457 F Y 6 8 G ... 1234 W Q 1235 43 X 1236 324 U M R T Y U O

Failure Recovery - Tract Server
When a tractserver receives an assignment of a new entry in the TLT, it contacts the other replicas and begins copying previously written tracts B M Locator Version Disk 1 Disk 2 Disk 3 1 9 A B C 2 18 Z 3 325 D H 4 E M 5 457 F Y 6 8 G ... 1234 W Q 1235 43 X 1236 324 U M R T Y U O C R D T

Failure Recovery - Client
All client operations are tagged with TLT entry version number Client Tract Server Single tract server failure Multiple tract server failure Metadata server failure/metadata & tractserver failure concurrently Metadata Server

Evaluation

Evaluation Failure recovery time reduces with more disks!

Evaluation Question: Speed gain breakdown between full bisection bandwidth and FDS?

Conclusion Flat storage provides simplicity for applications.
Deterministic data placement enables distributed metadata management. Without locality constraints, dynamic work allocation increase utilization. Highly scalable and fast failure recovery

Flat Datacenter Storage

Similar presentations

Presentation on theme: "Flat Datacenter Storage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flat Datacenter Storage

Similar presentations

Presentation on theme: "Flat Datacenter Storage"— Presentation transcript:

Similar presentations

About project

Feedback