1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

Perspective on Overlay Networks Panel: Challenges of Computing on a Massive Scale Ben Y. Zhao FuDiCo 2002.
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
The google file system Cs 595 Lecture 9.
Storage management and caching in PAST Antony Rowstron and Peter Druschel Presented to cs294-4 by Owen Cooper.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
The Oceanstore Regenerative Wide-area Location Mechanism Ben Zhao John Kubiatowicz Anthony Joseph Endeavor Retreat, June 2000.
OceanStore Global-Scale Persistent Storage John Kubiatowicz.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
OceanStore Global-Scale Persistent Storage John Kubiatowicz.
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
The Oceanic Data Utility: (OceanStore) Global-Scale Persistent Storage John Kubiatowicz.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
OceanStore Theoretical Issues and Open Problems John Kubiatowicz University of California at Berkeley.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Opportunities for Continuous Tuning in a Global Scale File System John Kubiatowicz University of California at Berkeley.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
1 The Google File System Reporter: You-Wei Zhang.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
Arnold N. Pears, CoRE Group Uppsala University 3 rd Swedish Networking Workshop Marholmen, September Why Tapestry is not Pastry Presenter.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
OceanStore: In Search of Global-Scale, Persistent Storage John Kubiatowicz UC Berkeley.
Distributed Architectures. Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity.
Practical Byzantine Fault Tolerance
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
OceanStore: An Architecture for Global- Scale Persistent Storage.
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Introduction to Active Directory
Societal-Scale Computing: The eXtremes Scalable, Available Internet Services Information Appliances Client Server Clusters Massive Cluster Gigabit Ethernet.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage
OceanStore: An Architecture for Global-Scale Persistent Storage
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Accessing nearby copies of replicated objects
OceanStore: Data Security in an Insecure world
John D. Kubiatowicz UC Berkeley
Content Distribution Network
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

1 OceanStore Global-Scale Persistent Storage Ying Lu CSCE496/896 Spring 2011

2 Give Credits Many slides are from John Kubiatowicz, University of California at Berkeley I have modified them and added new slides

3 Motivation Personal Information Mgmt is the Killer App –Not corporate processing but management, analysis, aggregation, dissemination, filtering for the individual –Automated extraction and organization of daily activities to assist people Information Technology as a Utility –Continuous service delivery, on a planetary- scale, on top of a highly dynamic information base

4 OceanStore Context: Ubiquitous Computing Computing everywhere: –Desktop, Laptop, Palmtop, Cars, Cellphones –Shoes? Clothing? Walls? Connectivity everywhere: –Rapid growth of bandwidth in the interior of the net –Broadband to the home and office –Wireless technologies such as CDMA, Satellite, laser Rise of the thin-client metaphor: –Services provided by interior of network –Incredibly thin clients on the leaves MEMS devices -- sensors+CPU+wireless net in 1mm 3 Mobile society: people move and devices are disposable

What do we need for personal information management? 5

6 Questions about information: Where is persistent information stored? –20 th -century tie between location and content outdated How is it protected? –Can disgruntled employee of ISP sell your secrets? –Can ’ t trust anyone (how paranoid are you?) Can we make it indestructible? –Want our data to survive “ the big one ” ! –Highly resistant to hackers (denial of service) –Wide-scale disaster recovery Is it hard to manage? –Worst failures are human-related –Want automatic (introspective) diagnose and repair

7 First Observation: Want Utility Infrastructure Mark Weiser from Xerox: Transparent computing is the ultimate goal –Computers should disappear into the background In storage context: –Don ’ t want to worry about backup, obsolescence –Need lots of resources to make data secure and highly available, BUT don ’ t want to own them –Outsourcing of storage already very popular Pay monthly fee and your “ data is out there ” –Simple payment interface  one bill from one company

8 Second Observation: Need wide-scale deployment Many components with geographic separation –System not disabled by natural disasters –Can adapt to changes in demand and regional outages Wide-scale use and sharing also requires wide- scale deployment –Bandwidth increasing rapidly, but latency bounded by speed of light Handling many people with same system leads to economies of scale

9 OceanStore: Everyone ’ s data, One big Utility “ The data is just out there ” Separate information from location –Locality is only an optimization (an important one!) –Wide-scale coding and replication for durability All information is globally identified –Unique identifiers are hashes over names & keys –Single uniform lookup interface –No centralized namespace required

10 Amusing back of the envelope calculation (courtesy Bill Bolotsky, Microsoft) How many files in the OceanStore? –Assume people in world –Say 10,000 files/person (very conservative?) –So files in OceanStore! –If 1 gig files (not likely), get 1 mole of files! Truly impressive number of elements … … but small relative to physical constants

11 Service provided by confederation of companies –Monthly fee paid to one service provider –Companies buy and sell capacity from each other Utility-based Infrastructure Pac Bell Sprint IBM AT&T Canadian OceanStore IBM

12 Outline Motivation Properties of the OceanStore Specific Technologies and approaches: –Naming and Data Location –Conflict resolution on encrypted data –Replication and Deep archival storage –Introspective computing for optimization and repair –Economic models Conclusion

13 Ubiquitous Devices  Ubiquitous Storage Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. Properties REQUIRED for OceanStore storage substrate: –Strong Security: data encrypted in the infrastructure; resistance to monitoring and denial of service attacks –Coherence: too much data for na ï ve users to keep coherent “ by hand ” –Automatic replica management and optimization: huge quantities of data cannot be managed manually –Simple and automatic recovery from disasters: probability of failure increases with size of system –Utility model: world-scale system requires cooperation across administrative boundaries

14 OceanStore Technologies I: Naming and Data Location Requirements: –System-level names should help to authenticate data –Route to nearby data without global communication –Don ’ t inhibit rapid relocation of data OceanStore approach: Two-level search with embedded routing –Underlying namespace is flat and built from secure cryptographic hashes (160-bit SHA-1) –Search process combines quick, probabilistic search with slower guaranteed search

15 Universal Location Facility Universal Name Name OID Root Structure Update OID: Archive versions: Version OID 1 Version OID 2 Version OID 3 Global Object Resolution Floating Replica Active Data Commit Logs Checkpoint OID Global Object Resolution Version OID Archival copy or snapshot Archival copy or snapshot Archival copy or snapshot Global Object Resolution Global Object Resolution Erasure Coded: Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches

16 Routing Two-tiered approach Fast probabilistic routing algorithm –Entities that are accessed frequently are likely to reside close to where they are being used (ensured by introspection) Slower, guaranteed hierarchical routing method Self-optimizing

17 Probabilistic Routing Algorithm n3n3 n4n4 n2n2 n1n1 X (0,1,3) z (0,2,4) bit bit st 2nd st nd Y (0,1,4) 1st Query for X (11010) M (1,3,4) st 2nd 3rd reliable factors 10 reliable factors 100 self-optimizing on the depth of the attenuated bloom filter array Bloom filter on each node; Attenuated Bloom filter on each directed edge.

18 Hierarchical Routing Algorithm Based on Plaxton scheme Every server in the system is assigned a random node-ID Object ’ s root –each object is mapped to a single node whose node- ID matches the object ’ s GUID in the most bits (starting from the least significant) Information about the GUID (such as location) were stored at its root

19 Construct Plaxton Mesh … 0324 x431 x742 x x x431 x742 x x

Basic Plaxton Mesh Incremental suffix-based routing NodeID 0x43FE NodeID 0x13FE NodeID 0xABFE NodeID 0x1290 NodeID 0x239E NodeID 0x73FE NodeID 0x423E NodeID 0x79FE NodeID 0x23FE NodeID 0x73FF NodeID 0x555E NodeID 0x035E NodeID 0x44FE NodeID 0x9990 NodeID 0xF990 NodeID 0x993E NodeID 0x04FE NodeID 0x43FE GUID 0x43FE a c b d e

21 Use of Plaxton Mesh Randomization and Locality

22 OceanStore Enhancements of the Plaxton Mesh Documents have multiple roots (Salted hash of GUID) Each node has multiple neighbor links Searches proceed along multiple paths –Tradeoff between reliability, performance and bandwidth? Dynamic node insertion and deletion algorithms –Continuous repair and incremental optimization of links self-healingself-optimizing self-configuration

23 OceanStore Technologies II: Rapid Update in an Untrusted Infrastructure Requirements: –Scalable coherence mechanism which can operate directly on encrypted data without revealing information –Handle Byzantine failures –Rapid dissemination of committed information OceanStore Approach: –Operations-based interface using conflict resolution Modeled after Xerox Bayou  updates packets include: Predicate/action pairs which operate on encrypted data –User signs Updates and principle party signs commits –Committed data multicast to clients

24 Update Model Concurrent updates w/o wide-area locking –Conflict resolution Updates Serialization A master replica? Role of primary tier of replicas –All updates submitted to primary tier of replicas which chooses a final total order by following Byzantine agreement protocol A secondary tier of replicas –The result of the updates is multicast down the dissemination tree to all the secondary replicas

Agreement Need agreement in DS: –Leader, commit, synchronize Distributed Agreement algorithm: all non-faulty processes achieve consensus in a finite number of steps Perfect processes, faulty channels: two- army Faulty processes, perfect channels: Byzantine generals

Two-Army Problem

Possible Consensus Agreement is possible in synchronous DS [e.g., Lamport et al.] –Messages can be guaranteed to be delivered within a known, finite time. –Byzantine Generals Problem A synchronous DS: can distinguish a slow process from a crashed one

Byzantine Generals Problem    

Byzantine Generals -Example (1) The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce the time to launch the attack (by messages marked by their ids). b)The vectors that each general assembles based on (a) c)The vectors that each general receives, where every general passes his vector from (b) to every other general.

Byzantine Generals –Example (2) The same as in previous slide, except now with 2 loyal generals and one traitor.

Byzantine Generals Given three processes, if one fails, consensus is impossible Given N processes, if F processes fail, consensus is impossible if N  3F

32 Tentative Updates: Epidemic Dissemination

33 Committed Updates: Multicast Dissemination

34 Data Coding Model Two distinct forms of data: active and archival Active Data in Floating Replicas –Latest version of the object Archival Data in Erasure Coded Fragments –A permanent, read-only version of the object –During commit, previous version coded with erasure-code and spread over 100s or 1000s of nodes –Advantage: any 1/2 or 1/4 of fragments regenerates data

35 Floating Replica and Deep Archival Coding Erasure-coded Fragments Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Full Cop y Conflict Resolution Logs Floating Replica

36 Proactive Self-Maintenance Continuous testing and repair of information –Slow sweep through all information to make sure there are sufficient erasure-coded fragments –Continuously reevaluate risk and redistribute data –Slow sweep and repair of metadata/search trees Continuous online self-testing of HW and SW –Detects flaky, failing, or buggy components via: fault injection: triggering hardware and software error handling paths to verify their integrity/existence stress testing: pushing HW/SW components past normal operating parameters scrubbing: periodic restoration of potentially “ decaying ” hardware or software state –Automates preventive maintenance

37 OceanStore Technologies IV: Introspective Optimization Requirements: –Reasonable job on global-scale optimization problem Take advantage of locality whenever possible Sensitivity to limited storage and bandwidth at endpoints –Repair of data structures, increasing of redundancy –Stability in chaotic environment  Active Feedback OceanStore Approach: –Introspective monitoring and analysis of relationships to cluster information by relatedness –Time series-analysis of user and data motion –Rearrangement and replication in response to monitoring Clustered prefetching: fetch related objects Proactive-prefetching: get data there before needed Rearrangement in response to overload and attack

38 Client observer and optimizer components –Greedy agents working on the behalf of the client Watches client activity/combines with historical info Performs clustering and time-series analysis Forwards results to infrastructure (privacy issues!) –Monitoring state of network to adapt behaviour Typical Actions: –Cluster related files together –Prefetch files that will be needed soon –Create/destroy floating replicas Example: Client Introspection

39 OceanStore Conclusion The Time is now for a Universal Data Utility –Ubiquitous computing and connectivity is (almost) here! –Confederation of utility providers is right model OceanStore holds all data, everywhere –Local storage is a cache on global storage –Provides security in an untrusted infrastructure Exploits economies of scale to: –Provide high-availability and extreme survivability –Lower maintenance cost: self-diagnosis and repair Insensitivity to technology changes: Just unplug one set of servers, plug in others