RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Fast Crash Recovery in RAMCloud
Memory and Object Management in a Distributed RAM-based Storage System Thesis Proposal Steve Rumble April 23 rd, 2012.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2014.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
File System Implementation
The design and implementation of a log-structured file system The design and implementation of a log-structured file system M. Rosenblum and J.K. Ousterhout.
SEDCL: Stanford Experimental Data Center Laboratory.
CS 140 Lecture Notes: Technology and Operating SystemsSlide 1 Technology Changes Mid-1980’s2009Change CPU speed15 MHz2 GHz133x Memory size8 MB4 GB500x.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
PRASHANTHI NARAYAN NETTEM.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
The RAMCloud Storage System
RAMCloud Design Review Recovery Ryan Stutsman April 1,
A Compilation of RAMCloud Slides (through Oct. 2012) John Ousterhout Stanford University.
It’s Time for Low Latency Steve Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, John Ousterhout Stanford University 報告者 : 厲秉忠
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
RAMCloud Overview John Ousterhout Stanford University.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Some key-value stores using log-structure Zhichao Liang LevelDB Riak.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
IMDGs An essential part of your architecture. About me
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Experiences With RAMCloud Applications John Ousterhout, Jonathan Ellithorpe, Bob Brown.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Implementing Linearizability at Large Scale and Low Latency Collin Lee, Seo Jin Park, Ankita Kejriwal, † Satoshi Matsushita, John Ousterhout Platform Lab.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
1 Lecture 20: Big Data, Memristors Today: architectures for big data, memristors.
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Filesystems 2 Adapted from slides of Hank Levy
CS 140 Lecture Notes: Technology and Operating Systems
RAMCloud Architecture
RAMCloud Architecture
CS 140 Lecture Notes: Technology and Operating Systems
The Design and Implementation of a Log-Structured File System
Presentation transcript:

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum, Stephen Rumble, and Ryan Stutsman)

DRAM in Storage Systems March 28, 2011RAMCloudSlide UNIX buffer cache Main-memory databases Large file caches Web indexes entirely in DRAM memcached Facebook: 200 TB total data 150 TB cache! Main-memory DBs, again

DRAM in Storage Systems ● DRAM usage limited/specialized ● Clumsy (consistency with backing store) ● Lost performance (cache misses, backing store) March 28, 2011RAMCloudSlide UNIX buffer cache Main-memory databases Large file caches Web indexes entirely in DRAM memcached Facebook: 200 TB total data 150 TB cache! Main-memory DBs, again

Harness full performance potential of large-scale DRAM storage: ● General-purpose storage system ● All data always in DRAM (no cache misses) ● Durable and available (no backing store) ● Scale: servers, 100+ TB ● Low latency: 5-10µs remote access Potential impact: enable new class of applications March 28, 2011RAMCloudSlide 4 RAMCloud

March 28, 2011RAMCloudSlide 5 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server ● All data always in RAM ● Durable and available ● Performance goals:  High throughput: 1M ops/sec/server  Low-latency access: 5-10µs RPC Application Servers Storage Servers Datacenter

Example Configurations For $ K today:  One year of Amazon customer orders  One year of United flight reservations March 28, 2011RAMCloudSlide 6 Today5-10 years # servers GB/server24GB256GB Total capacity48TB1PB Total server cost$3.1M$6M $/GB$65$6

March 28, 2011RAMCloudSlide 7 Why Does Latency Matter? ● Large-scale apps struggle with high latency  Facebook: can only make internal requests per page  Random access data rate has not scaled! UI App. Logic Data Structures Traditional Application UI App. Logic Application Servers Storage Servers Web Application << 1µs latency0.5-10ms latency Single machine Datacenter

March 28, 2011RAMCloudSlide 8 MapReduce Sequential data access → high data access rate  Not all applications fit this model  Offline Computation Data

March 28, 2011RAMCloudSlide 9 Goal: Scale and Latency ● Enable new class of applications:  Crowd-level collaboration  Large-scale graph algorithms  Real-time information-intensive applications Traditional ApplicationWeb Application << 1µs latency ms latency 5-10µs UI App. Logic Application Servers Storage Servers Datacenter UI App. Logic Data Structures Single machine

March 28, 2011RAMCloudSlide 10 RAMCloud Architecture Master Backup Master Backup Master Backup Master Backup … Appl. Library Appl. Library Appl. Library Appl. Library … Datacenter Network Coordinator 1000 – 10,000 Storage Servers 1000 – 100,000 Application Servers

create(tableId, blob) => objectId, version read(tableId, objectId) => blob, version write(tableId, objectId, blob) => version cwrite(tableId, objectId, blob, version) => version delete(tableId, objectId) March 28, 2011RAMCloudSlide 11 Data Model Tables Identifier (64b) Version (64b) Blob (≤1MB) Object (Only overwrite if version matches) Richer model in the future: Indexes? Transactions? Graphs? Richer model in the future: Indexes? Transactions? Graphs?

● Goals:  No impact on performance  Minimum cost, energy ● Keep replicas in DRAM of other servers?  3x system cost, energy  Still have to handle power failures  Replicas unnecessary for performance ● RAMCloud approach:  1 copy in DRAM  Backup copies on disk/flash: durability ~ free! ● Issues to resolve:  Synchronous disk I/O’s during writes??  Data unavailable after crashes?? March 28, 2011RAMCloudSlide 12 Durability and Availability

Disk Backup Buffered SegmentDisk Backup Buffered Segment ● No disk I/O during write requests ● Master’s memory also log-structured ● Log cleaning ~ generational garbage collection March 28, 2011RAMCloudSlide 13 Buffered Logging Master Disk Backup Buffered Segment In-Memory Log Hash Table Write request

● Power failures: backups must guarantee durability of buffered data:  DIMMs with built-in flash backup?  Per-server battery backups?  Caches on enterprise disk controllers? ● Server crashes:  Must replay log to reconstruct data  Meanwhile, data is unavailable  Solution: fast crash recovery (1-2 seconds)  If fast enough, failures will not be noticed ● Key to fast recovery: use system scale March 28, 2011RAMCloudSlide 14 Crash Recovery

● Master chooses backups statically  Each backup stores entire log for master ● Crash recovery:  Choose recovery master  Backups read log info from disk  Transfer logs to recovery master  Recovery master replays log ● First bottleneck: disk bandwidth:  64 GB / 3 backups / 100 MB/sec/disk ≈ 210 seconds ● Solution: more disks (and backups) March 28, 2011RAMCloudSlide 15 Recovery, First Try Recovery Master Backups

March 28, 2011RAMCloudSlide 16 Recovery, Second Try ● Scatter logs:  Each log divided into 8MB segments  Master chooses different backups for each segment (randomly)  Segments scattered across all servers in the cluster ● Crash recovery:  All backups read from disk in parallel  Transmit data over network to recovery master Recovery Master ~1000 Backups

● Disk no longer a bottleneck:  64 GB / 8 MB/segment / 1000 backups ≈ 8 segments/backup  100ms/segment to read from disk  0.8 second to read all segments in parallel ● Second bottleneck: NIC on recovery master  64 GB / 10 Gbits/second ≈ 60 seconds  Recovery master CPU is also a bottleneck ● Solution: more recovery masters  Spread work over 100 recovery masters  64 GB / 10 Gbits/second / 100 masters ≈ 0.6 second March 28, 2011RAMCloudSlide 17 Scattered Logs, cont’d

● Divide each master’s data into partitions  Recover each partition on a separate recovery master  Partitions based on tables & key ranges, not log segment  Each backup divides its log data among recovery masters March 28, 2011RAMCloudSlide 18 Recovery, Third Try Recovery Masters Backups Dead Master

March 28, 2011RAMCloudSlide 19 Other Research Issues ● Fast communication (RPC)  New datacenter network protocol? ● Data model ● Concurrency, consistency, transactions ● Data distribution, scaling ● Multi-tenancy ● Client-server functional distribution ● Node architecture

● Goal: build production-quality implementation ● Started coding Spring 2010 ● Major pieces coming together:  RPC subsystem ● Supports many different transport layers ● Using Mellanox Infiniband for high performance  Basic data model  Simple cluster coordinator  Fast recovery ● Performance (40-node cluster):  Read small object: 5µs  Throughput: > 1M small reads/second/server March 28, 2011RAMCloudSlide 20 Project Status

March 28, 2011RAMCloudSlide 21 Single Recovery Master MB/sec

March 28, 2011RAMCloudSlide 22 Recovery Scalability 1 master 6 backups 6 disks 600 MB 1 master 6 backups 6 disks 600 MB 11 masters 66 backups 66 disks 6.6 GB 11 masters 66 backups 66 disks 6.6 GB

● Achieved low latency (at small scale) ● Not yet at large scale (but scalability encouraging) ● Fast recovery:  1 second for memory sizes < 10GB  Scalability looks good  Durable and available DRAM storage for the cost of volatile cache ● Many interesting problems left ● Goals:  Harness full performance potential of DRAM-based storage  Enable new applications: intensive manipulation of large-scale data March 28, 2011RAMCloudSlide 23 Conclusion

March 28, 2011RAMCloudSlide 24 Why not a Caching Approach? ● Lost performance:  1% misses → 10x performance degradation ● Won’t save much money:  Already have to keep information in memory  Example: Facebook caches ~75% of data size ● Availability gaps after crashes:  System performance intolerable until cache refills  Facebook example: 2.5 hours to refill caches!

March 28, 2011RAMCloudSlide 25 Data Model Rationale How to get best application-level performance? Lower-level APIs Less server functionality Higher-level APIs More server functionality Key-value store Distributed shared memory : Server implementation easy Low-level performance good  APIs not convenient for applications  Lose performance in application-level synchronization Distributed shared memory : Server implementation easy Low-level performance good  APIs not convenient for applications  Lose performance in application-level synchronization Relational database : Powerful facilities for apps Best RDBMS performance  Simple cases pay RDBMS performance  More complexity in servers Relational database : Powerful facilities for apps Best RDBMS performance  Simple cases pay RDBMS performance  More complexity in servers

March 28, 2011RAMCloudSlide 26 RAMCloud Motivation: Technology Disk access rate not keeping up with capacity: ● Disks must become more archival ● More information must move to memory Mid-1980’s2009Change Disk capacity30 MB500 GB16667x Max. transfer rate2 MB/s100 MB/s50x Latency (seek & rotate)20 ms10 ms2x Capacity/bandwidth (large blocks) 15 s5000 s333x Capacity/bandwidth (1KB blocks) 600 s58 days8333x