Federated Array of Bricks Y Saito et al HP Labs CS 6464 Presented by Avinash Kulkarni.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 582 / CMPE 481 Distributed Systems
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
The Google File System.
Wide-area cooperative storage with CFS
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Distributed Databases
Distributed storage for structured data
Case Study - GFS.
File Systems (2). Readings r Silbershatz et al: 11.8.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
1 The Google File System Reporter: You-Wei Zhang.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
The concept of RAID in Databases By Junaid Ali Siddiqui.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Bigtable: A Distributed Storage System for Structured Data
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
Tackling I/O Issues 1 David Race 16 March 2010.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Dynamo: Amazon’s Highly Available Key-value Store
CHAPTER 3 Architectures for Distributed Systems
Storage Virtualization
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
RAID RAID Mukesh N Tekwani
Filesystems 2 Adapted from slides of Hank Levy
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
THE GOOGLE FILE SYSTEM.
RAID RAID Mukesh N Tekwani April 23, 2019
Presentation transcript:

Federated Array of Bricks Y Saito et al HP Labs CS 6464 Presented by Avinash Kulkarni

Agenda Motivation Current Approaches FAB – Design – Protocols, Implementation, Optimizations – Evaluation SSDs in enterprise storage

Motivation Enterprise storage devices offer high reliability and availability Rely on custom hardware solutions to gain better performance Redundant hardware configuration to eliminate any single point of failure, but still very centralized Workload characteristics vary widely based on client application Advantages – good performance, reliability, failover capabilities Disadvantages – expensive, high cost of development, lack of scalability Can we build a distributed, reliable but cheap storage solution that is also scalable and strongly consistent? Could we do this without altering clients in any way?

Have we come across some solutions in this class? GFS – Google File System – Cheap, Scalable, Reliable – Optimized for workloads that are mostly composed of reads, and appends – Requires clients to be aware of file system semantics – Some aspects are centralized Amazon - Dynamo – Cheap, Scalable, Reliable – Optimized for key-value storage – Eventual consistency – Pushes conflict resolution on to clients – Anti entropy, sloppy quorum, hinted handoff needed to manage ‘churn’ So, neither fits all our needs

Enter FAB Aims to provide a scalable, reliable distributed storage solution that is also strongly consistent and cheap Requires no modification to clients Innovations – Voting based replication – Dynamic quorum reconfiguration Strongly consistent – Safety over liveness Not optimized for any particular workload pattern Assumptions – Churn is low? – Failstop – Non-byzantine

FAB Structure Organized as a decentralized collection of bricks Each brick is composed of a server attached to some amount of storage All servers are connected with standard Ethernet interface Each server runs the same software stack composed – iSCSI front end, – Core layer for metadata processing, – backend layer for disk access Each server can take on 2 roles – Coordinator – Replica Observation – Buffer cache management is done by backend layer on each individual server. No global cache…

FAB operation Client sends a request to coordinator Coordinator locates the subset of bricks that store data for the request Coordinator runs replication or erasure code protocol against set of bricks, passing the to them Bricks user the tuple provided to locate the data on physical disks they contain

Key Data Structures Volume layout Maps a logical offset to a seggroup at segment granularity(256MB) Seggroup Describes layout of segment, including bricks that store the data Disk Map Maps logical offset in a request tuple to physical offset on disk at page granularity(8MB) Timestamp table Information about recently modified blocks of data

Data layout & Load balancing All segments in a seggroup store data at same replication/erasure code level Logical segments are mapped semi-randomly, preferring seggroups that are lightly utilized Logical block to physical offset mapping on bricks is done randomly How many seggroups per brick? – Higher the number, the more evenly load can be spread in the system on event of a brick failure – But as we increase the ratio, we reduce the reliability of the system because it increases the chances of data loss from combination failures – FAB uses 4 as an optimal number Load balancing – Status messages flow between replicas, exchanging information about load – Gossip based failure detection mechanism – Not static like Dynamo which relies on consistent hashing for distributing load – No concept of virtual nodes to accommodate heterogeneity of storage server capabilities.

Design decisions 2 approaches to redundancy – Replication – Erasure codes Replication – Each segment of data is replicated across some number of replicas – Unlike Dynamo, no version numbers are assigned to copies of data as vector timestamps – Instead, each replica maintains 2 timestamps for each 512 byte block of data on the replica ordT: Timestamp of the newest ongoing write operation valT: Timestamp of the last completed write that has been written out to disks

Design Decisions - Replication Write proceeds in 2 phases – Order phase Generate a new timestamp for this write, and send it to the bricks. Wait till majority accept – Write phase Replicas update data from client and set valT to be equal to ordT generated above Read is usually 1 phase – Except when an incomplete write is detected. Read must now discover the value with the newest timestamp from majority, then writes that value with timestamp greater than any previous writes to a majority Unlike Dynamo, Read always returns the value of the last recent write that was committed by a majority.

Design Decisions – Replication Protocol

Design Decisions – Erasure Coding FAB supports (m, n) Reed-Solomon erasure coding. – Generate (n-m) parity blocks – Reconstruct original data blocks by reading any m of n blocks Coordinator must now collect m + ceil( (n – m )/2 ) responses for any operation, so that intersection of any 2 quorums will contain at least m bricks Write requests are now written to a log during phase 2 instead of being flushed to the disk right away On detecting incomplete write, read will scan log to find newest strip value that can be reconstructed Log does not have to grow infinitely. Coordinator can direct bricks to compress logs once response is returned to client. A asynchronous ‘Commit’ operation is sent to bricks to direct them to flush log writes to disks

Design Decisions – Erasure Code Some modifications to replication layer needed to implement EC on FAB – Addition of a log layer on replicas to eliminate strip corruption on incomplete writes – Addition of a asynchronous commit operation at the end of write to compress and garbage collect logs on replicas

Design Decision - Optimizations Timestamps – One 24 byte timestamp for every 512byte block is very expensive in terms of storage – Recall that timestamps are needed only to distinguish between concurrent operations – So, coordinator can ask bricks to garbage collect the timestamps once all replicas have committed the operation to disks – Most requests span multiple blocks, so they will have the same timestamps. An ordered tree could be used to store a mapping between timestamps and the data blocks they pertain to Read operations – Reads of data from a quorum can be expensive in terms of bandwidth. So read requires only timestamps from replicas, while data could be efficiently read off one replica Coordinator failures are not handled by the system as a simplification. Clients can detect these, and issue the same request to a new coordinator

Design Decisions - Reconfiguration View agreement through dynamic voting in a three-round membership protocol Need for a synchronization step after a new view is agreed upon

Evaluation Application performance 3 different workloads – tar(bulk reads), untar(bulk writes), compile(reads and writes)

Evaluation Scalability – aggregate throughput and random read/write throughput

Evaluation Scalability – comparison of throughput with master-slave FAB

Evaluation Handling changes – comparison with master-slave FAB

SSDs in enterprise storage – An analysis of tradeoffs Dushyanth Narayanan et al

Are SSDs useful in current enterprise environment? Use of SSDs in enterprise storage systems requires optimizing for performance, capacity, power and reliability Currently, SSDs offer very high advantage in terms of random access read operations, and lower power consumption But they are very expensive – Orders of magnitude costlier than current disks(IDE/SCSI) – Prices have been declining, and predicted to fall further – But are there situations in current enterprise environment where SSDs could be used now?

Analysis of tradeoffs Here’s a listing of device characteristics

Analysis of tradeoffs Authors traced typical workloads in enterprise environment and enumerated device characteristics

Analysis of tradeoffs How do SSDs rack up against normal drives?

Analysis of tradeoffs Can SSDs completely replace normal drives for the workloads traced based on their capacity/random-reads requirement? Not really.

Analysis of tradeoffs How do SSDs measure up to normal drives on price points? Not very well at present.

Analysis of tradeoffs How about power? Can we not save a lot of money based on the small power footprint of SSDs? Turns out we don’t really save much at current energy prices

Analysis of tradeoffs So, is there a place for SSDs in enterprise storage? Maybe – recall the tiered storage model we use with DRAM and normal drives DRAM serves as a buffer for underlying slow, normal drives Insight – SSDs could do serve the same purpose! They are fast, but cheaper than DRAM. Also, they are persistent, compared to DRAM. We need to account for the poor random write performance before we do this. Use newer flash file systems that convert random writes to sequential log entries

Some discussion points Each brick in FAB maintains its own cache – how does that affect performance? How can FAB accommodate heterogeneous storage servers? Any others?

Analysis of tradeoffs