CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.

CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable High-Performance Storage Entirely in DRAM  Goals: –Low latency: 5-10 µs (not milliseconds) –High throughput: 1M operations/s –Key-value storage with 1000s of servers –No replicas –Entirely in DRAM -- disks only for backup –Fast recovery (1-2 secs)

Why DRAM? Mid-1980’s2009Change Disk capacity30 MB500 GB16667x Max. transfer rate2 MB/s100 MB/s50x Latency (seek & rotate)20 ms10 ms2x Capacity/bandwidth (large blocks) 15 s5000 s333x Capacity/bandwidth (1KB blocks) 600 s58 days8333x Jim Gray’s Rule (1KB)5 min.30 hours360x Today5-10 years # servers10004000 GB/server64GB256GB Total capacity64TB1PB Total server cost$4M$6M $/GB$60$6

DRAM in Storage Systems  Largely used as a cache to secondary storage 3 consistency with secondary storage performance loss due to cache misses; backup operations

… Appl. Library … Datacenter Network Coordinator 1000 – 10,000 Storage Servers 1000 – 100,000 Application Servers 64-256GB per server RAMCloud: Overall Architecture Appl. Library Appl. Library Appl. Library Master Backup Master Backup Master Backup Master Backup High-speed networking: 5 µs round-trip full bisection bandwidth Commodity Servers

create(tableId, blob) => objectId, version read(tableId, objectId) => blob, version write(tableId, objectId, blob) => version cwrite(tableId, objectId, blob, version) => version delete(tableId, objectId) Enumerate objects in table Efficient multi-read, multi-write Atomic increment Data Model Tables: a collection of objects Identifier (≤64B) Version (64b) Blob (≤1MB) Object (Only overwrite if version matches) object: key-value pair (w/ version #) must read/write in its entirety RAMCloud: optimized for small objects

create(tableId, blob) => objectId, version read(tableId, objectId) => blob, version write(tableId, objectId, blob) => version cwrite(tableId, objectId, blob, version) => version delete(tableId, objectId) Enumerate objects in table Efficient multi-read, multi-write Atomic increment Data Model Tables: a collection of objects Identifier (≤64B) Version (64b) Blob (≤1MB) Object (Only overwrite if version matches) object: key-value pair (w/ version #) Richer data models can be defined: (secondary) indices transactions  atomic updates of multiple objects graphs

Coordinator Handles configuration-related issues –cluster membership and –distribution & placement of data among the servers –not normally involved in common read/write operations e.g., Clients obtain a map from the coordinator –and cache Coordinator is backed up with a shadow coordinator 7 Table #First KeyLast KeyServer 1202 64 - 1192.168.0.1 4763,7425,723,742192.168.0.2 ………

Per-Node Architecture Each storage server consists of two components  Master module: manages main memory to store RAMCloud objects; handles read/write from clients  Backup module: stores backup copies of data owned by masters on other servers in local disk/flash Master Backup

Log-Structured Memory/Storage  Master memory (and backup storage) is organized in logs similar to log-structured file systems, but with simpler meta-data a unified mechanism for both in-memory and on-disk data  Log is immutable: can only be appended will be periodically “cleaned” for garbage collection by removing deleted objects  Each master has its own log divided into segments of 8MB each segment is backed up in 2-3 servers each segment has a unique id & a log digest  Master uses a different set of servers to replicate each segment (why?) each master’s segment replicas scatter across entire cluster

Read, Write Operations & Backup in RAMCloud: Ensuring Durability  Each master keeps a separate hash table given table id and key id, each object can be looked up quickly each live object has exactly one pointer stored in hash table  Read: given table id and key id, look up the hash table similar to GFS, client talks to the coordinator first to get the “table map”  Write: first adds new object in memory, then forwards new object to its backup servers each backup server appends it to its nonvolatile memory buffer and replies to master after receiving replies from all backups, master responds to client a backup segment only wrote to disk/flash when it is complete

Write and Buffered Logging  no disk I/O during write operations  Log-structured: backup servers’ disks and master’s memory  In-memory log buffer at backup servers: non-volatile memory  Log cleaning: reclaiming memory & storage space

Master Crash & Fast Recovery  Segments of crashed master logs are scatted across backup servers across the cluster avoid disk I/O bottleneck  Employ multiple recovery masters (instead of one!) –35 GB recovery in 1.6s using 60 nodes (SOSP’11 paper) Recovery Masters Backups Crashed Master

Other Important Features  Log meta: objects in log are “self-identifying”: with tbl/key id’s & ver# segment log digest: what objects contained in a segment tombstones: records appended about modified/deleted objects  Log cleaning and garbage collection: two-level cleaning memory: segment compaction and seglets log cleaner: select a segment and decide whether to clean it by copying live objects to new “survivor” segments; notify backup servers to clean storage space  Non-volatile memory to survive power failures: either via backup power to flush out in-memory log buffer or new non-volatile memory (with super capacitor as backup power)  Infiniband (vs. Ethernet) for reduced network latency) hardware support for networking stacks; support RDMA (remote direct memory) with “zero-copy” 13

Typical Datacenter Latency

RAMCloud Summary  Goals: Availability and Durability, and Low Latency!  first two to be achieved via no or minimal impact on performance  and minimize system costs and energy  RAMCloud approach 1 copy in DRAM, replicas in disk/flash (contrast: 3 copies in DRAM) “cheat”: rely on nonvolatile memory buffers in back-up servers  Network latency becomes a critical issue! “cheat”: use Inifiband wth hardware networking support & RDMA ONOS: uses RAMCloud for distributed network state management, but RAMcloud also replies on network (in this case, infiniband!) is the inifiband network managed by ONOS or not? “chicken & egg” problem? 15

CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.

Similar presentations

Presentation on theme: "CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.

Similar presentations

Presentation on theme: "CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable."— Presentation transcript:

Similar presentations

About project

Feedback