Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte.

Slides:



Advertisements
Similar presentations
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Advertisements

High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Pond The OceanStore Prototype. Pond -- Dennis Geels -- January 2003 Talk Outline System overview Implementation status Results from FAST paper Conclusion.
Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon,
Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon,
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Wide-area cooperative storage with CFS
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
G Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services.
Case Study - GFS.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Practical Byzantine Fault Tolerance
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
OceanStore: An Architecture for Global- Scale Persistent Storage.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Disk Farms at Jefferson Lab Bryan Hess
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Fault Tolerance
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Seminar On Rain Technology
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Google File System.
Providing Secure Storage on the Internet
The Google File System (GFS)
Pond: the OceanStore Prototype
OceanStore: An Architecture for Global-Scale Persistent Storage
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System (GFS)
Presentation transcript:

Pond The OceanStore Prototype

Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte storage within three years Solution: OceanStore

OceanStore Internet-scale Cooperative file system High durability Universal availability Two-tier storage system Upper tier: powerful servers Lower tier: less powerful hosts

OceanStore

More on OceanStore Unit of storage: data object Applications: , UNIX file system Requirements for the object interface Information universally accessible Balance between privacy and sharing Simple and usable consistency model Data integrity

OceanStore Assumptions Infrastructure untrusted except in aggregate Most nodes are not faulty and malicious Infrastructure constantly changing Resources enter and exit the network without prior warning Self-organizing, self-repairing, self-tuning

OceanStore Challenges Expressive storage interface High durability on untrusted and changing base

Data Model The view of the system that is presented to client applications

Storage Organization OceanStore data object ~= file Ordered sequence of read-only versions Every version of every object kept forever Can be used as backup An object contains metadata, data, and references to previous versions

Storage Organization A stream of objects identified by AGUID Active globally-unique identifier Cryptographically-secure hash of an application-specific name and the owner’s public key Prevents namespace collisions

Storage Organization Each version of data object stored in a B-tree like data structure Each block has a BGUID Cryptographically-secure hash of the block content Each version has a VGUID Two versions may share blocks

Storage Organization

Application-Specific Consistency An update is the operation of adding a new version to the head of a version stream Updates are applied atomically Represented as an array of potential actions Each guarded by a predicate

Application-Specific Consistency Example actions Replacing some bytes Appending new data to an object Truncating an object Example predicates Check for the latest version number Compare bytes

Application-Specific Consistency To implement ACID semantic Check for readers If none, update Append to a mailbox No checking No explicit locks or leases

Application-Specific Consistency Predicate for reads Examples Can’t read something older than 30 seconds Only can read data from a specific time frame

System Architecture Unit of synchronization: data object Changes to different objects are independent

Virtualization through Tapestry Resources are virtual and not tied to particular hardware A virtual resource has a GUID, globally unique identifier Use Tapestry, a decentralized object location and routing system Scalable overlay network, built on TCP/IP

Virtualization through Tapestry Use GUIDs to address hosts and resources Hosts publish the GUIDs of their resources in Tapestry Hosts also can unpublish GUIDs and leave the network

Replication and Consistency A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs No issues for replication The mapping from AGUID to the latest VGUID may change Use primary-copy replication

Replication and Consistency The primary copy Enforces access control Serializes concurrent updates

Archival Storage Replication: 2x storage to tolerate one failure Erasure code is much better A block is divided into m fragments m fragments encoded into n > m fragments Any m fragments can restore the original object

Caching of Data Objects Reconstructing a block from erasure code is an expensive process Need to locate m fragments from m machines Use whole-block caching for frequently- read objects

Caching of Data Objects To read a block, look for the block first If not available Find block fragments Decode fragments Publish that the host now caches the block Amortize the cost of erasure encoding/decoding

Caching of Data Objects Updates are pushed to secondary replicas via application-level multicast tree

The Full Update Path Serialized updates are disseminated via the multicast tree for an object At the same time, updates are encoded and fragmented for long-term storage

The Full Update Path

The Primary Replica Primary servers run Byzantine agreement protocol Need more than 2/3 nonfaulty participants Messages required grow quadratic in the number of participants

Public-Key Cryptography Too expensive Use symmetric-key message authentication codes (MACs) Two to three orders of magnitude faster Downside: can’t prove the authenticity of a message to the third party Used only for the inner ring Public-key cryptography for outer ring

Proactive Threshold Signatures Byzantine agreement guarantees correctness if not more than 1/3 servers fail during the life of the system Not practical for a long-lived system Need to reboot servers at regular intervals Key holders are fixed

Proactive Threshold Signatures Proactive threshold signatures More flexibility in choosing the membership of the inner ring A public key is paired with a number of private keys Each server uses its key to generate a signature share

Proactive Threshold Signatures Any k shares may be combined to produce a full signature To change membership of an inner ring Regenerate signature shares No need to change the public key Transparent to secondary hosts

The Responsible Party Who chooses the inner ring? Responsible party: A server that publishes sets of failure- independent nodes Through offline measurement and analysis

Software Architecture Java atop the Staged Event Driven Architecture (SEDA) Each subsystem is implemented as a stage With each own state and thread pool Stages communicate through events 50,000 semicolons by five graduate students and many undergrad interns

Software Architecture

Language Choice Java: speed of development Strongly typed Garbage collected Reduced debugging time Support for events Easy to port multithreaded code in Java Ported to Windows 2000 in one week

Language Choice Problems with Java: Unpredictability introduced by garbage collection Every thread in the system is halted while the garbage collector runs Any on-going process stalls for ~100 milliseconds May add several seconds to requests travel cross machines

Experimental Setup Two test beds Local cluster of 42 machines at Berkeley Each with GHz Pentium III 1.5GB PC133 SDRAM 2 36GB hard drives, RAID 0 Gigabit Ethernet adaptor Linux SMP

Experimental Setup PlanetLab, ~100 nodes across ~40 sites 1.2 GHz Pentium III, 1GB RAM ~1000 virtual nodes

Storage Overhead For 32 choose 16 erasure encoding 2.7x for data > 8KB For 64 choose 16 erasure encoding 4.8x for data > 8KB

The Latency Benchmark A single client submits updates of various sizes to a four-node inner ring Metric: Time from before the request is signed to the signature over the result is checked Update 40 MB of data over 1000 updates, with 100ms between updates

The Latency Benchmark Update Latency (ms) Key Size Update Size 5% Time Median Time 95% Time 512b 4kB MB b 4kB MB Latency Breakdown PhaseTime (ms) Check0.3 Serialize6.1 Apply1.5 Archive4.5 Sign77.8

The Throughput Microbenchmark A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ring The clients Create their objects Synchronize themselves Update the object as many time as possible for 100 seconds

The Throughput Microbenchmark

Archive Retrieval Performance Populate the archive by submitting updates of various sizes to a four-node inner ring Delete all copies of the data in its reconstructed form A single client submits reads

Archive Retrieval Performance Throughput: 1.19 MB/s (Planetlab) 2.59 MB/s (local cluster) Latency ~30-70 milliseconds

The Stream Benchmark Ran 500 virtual nodes on PlanetLab Inner Ring in SF Bay Area Replicas clustered in 7 largest P-Lab sites Streams updates to all replicas One writer - content creator – repeatedly appends to data object Others read new versions as they arrive Measure network resource consumption

The Stream Benchmark

The Tag Benchmark Measures the latency of token passing OceanStore 2.2 times slower than TCP/IP

The Andrew Benchmark File system benchmark 4.6x than NFS in read-intensive phases 7.3x slower in write-intensive phases