Flat Datacenter Storage

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
The Zebra Striped Network File System Presentation by Joseph Thompson.
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Google File System.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Flat Datacenter Storage Microsoft Research, Redmond Ed Nightingale, Jeremy Elson Jinliang Fan, Owen Hofmann, Jon Howell, Yutaka Suzue.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
rain technology (redundant array of independent nodes)
Slicer: Auto-Sharding for Datacenter Applications
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
File System Implementation
Google Filesystem Some slides taken from Alan Sussman.
Storage Virtualization
A Survey on Distributed File Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
DUCKS – Distributed User-mode Chirp-Knowledgeable Server
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
ICOM 6005 – Database Management Systems Design
An Introduction to Computer Networking
Chapter 17: Database System Architectures
Cse 344 May 4th – Map/Reduce.
Outline Midterm results summary Distributed file systems – continued
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
CS 345A Data Mining MapReduce This presentation has been altered.
THE GOOGLE FILE SYSTEM.
The SMART Way to Migrate Replicated Stateful Services
Database System Architectures
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, and Jinliang Fan, Owen Hofmann, Jon Howell and Yutaka Suzue Presented by Kuangyuan Chen, Qi Gao EECS 582 – W16

Outline Motivation Overview Distributed Metadata Management Dynamic Work Allocation Replication and Failure Recovery Performance Evaluation

Motivation move computation to data? because of bandwidth shortage in the datacenter e.g. MapReduce locality constraints hinder efficient resource utilization stragglers, retasking As we have seen in the mapreduce paper, we learned that network bandwidthis a scarce resource. This restriction leads to the design of moving computation to data in the MapReduce paper. However, this optimization is not free. We gain performance through locality at the expense of efficient resource utilization. One example is straggler. If there exists a slow machine, the entire job cannot complete until the slow machine finishes, while most of other machines are idle. The common solution is to re-execute the task in the other machine. But because of the locality constraints, we have to move the data first before we can restart the task on another machine. This is expensive but worthwhile if network bandwidth is a scarce resource. But what if datacenters are no longer in shortage of network bandwidth.

Motivation CLOS network supports full bisection bandwidth CLOS network can provide full bisection bandwidth, which make datacenter bandwidth abundant. A typical CLOS network looks like this. On the two sides are Top of Rack Routers, and each of them are connected to many spine routers in the middle. Network traffic can be balanced between different spine routers, thus providing full bisection bandwidth. Top of Rack(TOR) routers spine routers

Flat Datacenter Storage All compute nodes can access all data with equal throughput Simple and easy to program All data operations are remote. All machine have as much network bandwidth as disk bandwidth. Flat datacenter storage is based on such an assumption that network bandwidth is abundant. In FDS, all compute nodes can access all data with equal throughput. Therefore, applications are programmed without the consideration of locality. More specifically, all data are stored in remote servers , and network bandwidth is as much as disk bandwidth for each machine.

Overview Logically Centralized Storage Array Blob Tract FDS API Per-Blob Metadata 8MB Blob 0x5fab97ff da5c7c00: Tract -1 Tract 0 Tract 1 ... Tract N Logically Centralized Storage Array Blob byte sequence named with GUID Tract unit of read and write constant sized (e.g. 8MB) FDS API e.g. CreateBlob(), WriteTract(), ReadTract() asynchronous/non-blocking: can be issued in parallel Basically, FDS provides the abstraction of a logically centralized storage array. Data are logically stored in blobs with a global unique identifier. A blob contains a variant number of tracts. Reads and writes are done in the unit of tracts, which are fixed-size and 8MB in this paper. Users interact with FDS through a set of APIs provided by FDS and these APIs are non-blocking, they can be issued in parallel and the underlying storage system can process these request concurrently.

Distributed Metadata Management Tractserver a process that manages a disk lay out tracts on the disk directly using raw disk interface Tract Locator Table (TLT) a list of active tractservers Tract_Locator = (Hash(GUID) + i) mod TLT_Length deterministic, and produce uniform disk utilization One key feature of FDS is its distributed metadata management. There are several components cooperating to achieve distributed metadata management. A tractsever is a process residing with a disk. It manages the disk and services the reads and writes from clients. Tracts in the disk are accessed by the tractserver directly through raw disk interface. So if a client want to read a tract, in addition to the tract number, the only thing it need to know is the address of that specific tractserver. Such information is kept in a data structure called tract locator table. A tractserver can be located by computing the tract locator, which is a hash function of blob’s GUID and tract number. One thing to notice is that the process of finding a tractserver is deterministic. This means we don’t have to consult a centralized metadata server every time we issue read and write. And this eliminates the bottleneck bottleneck a centralized metadata server. Moreover, hash function is used to randomize data accesses among different tractservers , and thus improving the disk utilization.

Tract Locator List Locator(Row) Version Disk 1 Disk 2 Disk 3 122 A G H 122 A G H 1 5 B D F 2 6 C T 3 E V 4 373 R ... TLT_Length 160 U I Tractserver Address Here is an example of the tract locator list. We compute the row number using the hash function. Within each entry, there are the addresses of trackserver that contain the tract. there might be multiple tractservers in an entry, which serves as replication. The version number in each entry is for failure recovery, which Qi will introduce later. As you can see, TLT only keeps information about tractservers, so size is relatively small which makes such a design scalable. Tract_Locator = (Hash(GUID) + i) mod TLT_Length Tractserver versioning for failure recovery

Distributed Metadata Management(cont) Metadata Server create TLT by balancing among tractservers distribute TLT to clients assign version number to tractservers in critical path only at client startup tract locator table is created and managed by the metadata server. And it will distribute TLT to clients when they start. Since each client has a copy of TLT, and in normal cases this table will not change, further read and write operations can go directly to tractservers without going to the metadata server. So the metadata is essentially distributed and clients can fully utilize the network bandwidth.

Dynamic Work Allocation mitigate stragglers decouple data and computation assign work to workers dynamically and at fine granularity reduce dispersion to time of a single work unit Another important feature is dynamic work allocation. Since now data and computation are decoupled, retasking have very low cost and then we can retask work with higher flexibility. FDS divides work into small units and assigns them to clients dynamically. It assigns a unit to a client upon the completion of the previous unit. So even if there is a straggler, only a small number of work units will be run on that straggler. So the effect of stragglers will not be significant. Next, Qi will talk about replication and failure recovery.

Replication When a disk fails, redundant copies of the lost data are used to restore the data to full replication.

Replication As long as the lost data tracts are restored somewhere in the system, we are good.

Replication All disk pairs appear in the table O(n^2) table size When a disk fails, the lost data can be recovered using the rest of disks in parallel Locator Disk 1 Disk 2 Disk 3 1 A B C 2 Z 3 D H 4 E M 5 F Y 6 G ... 1234 W Q 1235 X 1236 U

Failure Recovery - Metadata Server Increment the version number of each row in which the failed tractserver appears Pick random tractservers to fill in the empty spaces in the TLT Sends updated TLT assignments to every server affected by the changes Wait for each tractserver to ack the new TLT assignments, and then begins to give out the new TLT to clients when queried for it Locator Version Disk 1 Disk 2 Disk 3 1 8 A B C 2 17 Z 3 324 D H 4 E M 5 456 F Y 6 7 G ... 1234 W Q 1235 43 X 1236 U Locator Version Disk 1 Disk 2 Disk 3 1 9 A B C 2 18 Z 3 325 D H 4 E M 5 457 F Y 6 8 G ... 1234 W Q 1235 43 X 1236 324 U M R T Y U O

Failure Recovery - Tract Server When a tractserver receives an assignment of a new entry in the TLT, it contacts the other replicas and begins copying previously written tracts B M Locator Version Disk 1 Disk 2 Disk 3 1 9 A B C 2 18 Z 3 325 D H 4 E M 5 457 F Y 6 8 G ... 1234 W Q 1235 43 X 1236 324 U M R T Y U O C R D T

Failure Recovery - Client All client operations are tagged with TLT entry version number Client Tract Server Single tract server failure Multiple tract server failure Metadata server failure/metadata & tractserver failure concurrently Metadata Server

Evaluation

Evaluation Failure recovery time reduces with more disks!

Evaluation Question: Speed gain breakdown between full bisection bandwidth and FDS?

Conclusion Flat storage provides simplicity for applications. Deterministic data placement enables distributed metadata management. Without locality constraints, dynamic work allocation increase utilization. Highly scalable and fast failure recovery

Q&A