Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon,

Slides:



Advertisements
Similar presentations
Brocade: Landmark Routing on Peer to Peer Networks Ben Y. Zhao Yitao Duan, Ling Huang, Anthony Joseph, John Kubiatowicz IPTPS, March 2002.
Advertisements

Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
POND: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao and John Kubiatowicz UC, Berkeley File and Storage.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Pond The OceanStore Prototype. Pond -- Dennis Geels -- January 2003 Talk Outline System overview Implementation status Results from FAST paper Conclusion.
Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte.
Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon,
David Choffnes, Winter 2006 OceanStore Maintenance-Free Global Data StorageMaintenance-Free Global Data Storage, S. Rhea, C. Wells, P. Eaton, D. Geels,
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
A Dependable Auction System: Architecture and an Implementation Framework
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley.
Naming and Integrity: Self-Verifying Data in Peer-to-Peer Systems Hakim Weatherspoon, Chris Wells, John Kubiatowicz University of California, Berkeley.
OceanStore: Data Security in an Insecure world John Kubiatowicz.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Tapestry on PlanetLab Deployment Experiences and Applications Ben Zhao, Ling Huang, Anthony Joseph, John Kubiatowicz.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Wide-area cooperative storage with CFS
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Secure File Storage Nathanael Paul CRyptography Applications Bistro March 25, 2004.
Communication Part IV Multicast Communication* *Referred to slides by Manhyung Han at Kyung Hee University and Hitesh Ballani at Cornell University.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Federated, Available, and Reliable Storage for an Incompletely Trusted Environment Atul Adya, Bill Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken,
1 The Google File System Reporter: You-Wei Zhang.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
EE616 Technical Project Video Hosting Architecture By Phillip Sutton.
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Peer-to-Pee Computing HP Technical Report Chin-Yi Tsai.
Practical Byzantine Fault Tolerance
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Intrusion Tolerant Software Architectures Bruno Dutertre, Valentin Crettaz, Victoria Stavridou System Design Laboratory, SRI International
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
ENERGY-EFFICIENCY AND STORAGE FLEXIBILITY IN THE BLUE FILE SYSTEM E. B. Nightingale and J. Flinn University of Michigan.
Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Bruce Hammer, Steve Wallis, Raymond Ho
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
TRUST Self-Organizing Systems Emin G ü n Sirer, Cornell University.
/ Fast Web Content Delivery An Introduction to Related Techniques by Paper Survey B Li, Chien-chang R Sung, Chih-kuei.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Curator: Self-Managing Storage for Enterprise Clusters
OceanStore: An Architecture for Global-Scale Persistent Storage
Accessing nearby copies of replicated objects
Pond: the OceanStore Prototype
OceanStore: An Architecture for Global-Scale Persistent Storage
Dynamic Replica Placement for Scalable Content Delivery
Presentation transcript:

Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley

The OceanStore “Vision” HotOS Attendee me Paul Hogan

The Challenges Maintenance –Many components, many administrative domains –Constant change –Must be self-organizing –Must be self-maintaining –All resources virtualized—no physical names Security –High availability is a hacker’s target-rich environment –Must have end-to-end encryption –Must not place too much trust in any one host

Talk Outline Introduction System Overview –Tapestry –Erasure codes –Byzantine agreement –Putting it all together Implementation and Deployment Performance Results Conclusion

The Technologies: Tapestry Tapestry performs Distributed Object Location and Routing From any host, find a nearby… –replica of a data object Efficient –O(log N ) location time, N = # of hosts in system Self-organizing, self-maintaining

The Technologies: Tapestry (con’t.) HotOS Attendee Paul Hogan

The Technologies: Erasure Codes More durable than replication for same space The technique: Z WW ZY X f f -1

The Technologies: Byzantine Agreement Guarantees all non-faulty replicas agree –Given N =3f +1 replicas, up to f may be faulty/corrupt Expensive –Requires O(N 2 ) communication Combine with primary-copy replication –Small number participate in Byzantine agreement –Multicast results of decisions to remainder

Putting it all together: the Path of a Write Primary Replicas HotOS Attendee Other Researchers Archival Servers (for durability) Secondary Replicas (soft state)

Talk Outline Introduction System Overview Implementation and Deployment Performance Results Conclusion

Prototype Implementation All major subsystems operational –Self-organizing Tapestry base –Primary replicas use Byzantine agreement –Secondary replicas self-organize into multicast tree –Erasure-coding archive –Application interfaces: NFS, IMAP/SMTP, HTTP Event-driven architecture –Built on SEDA 280K lines of Java (J2SE v1.3) –JNI libraries for cryptography, erasure coding

Deployment on PlanetLab –~100 hosts, ~40 sites –Shared.ssh/authorized_keys file Pond: up to 1000 virtual nodes –Using custom Perl scripts –5 minute startup Gives global scale for free

Talk Outline Introduction System Overview Implementation and Deployment Performance Results –Andrew Benchmark –Stream Benchmark Conclusion

Performance Results: Andrew Benchmark Built a loopback file server in Linux –Translates kernel NFS calls into OceanStore API Lets us run the Andrew File System Benchmark Andrew Benchmark Linux Kernel Loopback Server Pond Daemon fwrite syscallNFS Write Pond API Msg to Primary Network

OceanStore PhaseNFS I II III IV V Total (times in milliseconds) Performance Results: Andrew Benchmark Pond faster on reads: 4.6x –Phases III and IV –Only contact primary when cache older than 30 seconds Ran Andrew on Pond –Primary replicas at UCB, UW, Stanford, Intel Berkeley –Client at UCB –Control: NFS server at UW But slower on writes: 7.3x –Phases I, II, and V –Only 1024-bit are secure –512-bit keys show CPU cost

Closer Look: Write Cost Byzantine algorithm adapted from Castro & Liskov –Gives fault tolerance, security against compromise –Fast version uses symmetric cryptography Pond uses threshold signatures instead –Signature proves that f +1 primary replicas agreed –Can be shared among secondary replicas –Can also change primaries w/o changing public key Big plus for maintenance costs –Results good for all time once signed –Replace faulty/compromised servers transparently

Closer Look: Write Cost Small writes –Signature dominates –Threshold sigs. slow! –Takes 70+ ms to sign –Compare to 5 ms for regular sigs. Phase 4 kB write 2 MB write Validate Serialize Apply Archive Sign Result (times in milliseconds) Large writes –Encoding dominates –Archive cost per byte –Signature cost per write

Closer Look: Write Cost (run on cluster)

Closer Look: Write Cost Throughput in the wide area: Primary location Client locationTput (MB/s) Cluster 2.59 ClusterPlanetLab1.22 Bay AreaPlanetLab1.19 (archive on) Wide Area Throughput –Not limited by signatures –Not limited by archive –Not limited by Byzantine process bandwidth use –Limited by client-to-primary replicas bandwidth

Talk Outline Introduction System Overview Implementation and Deployment Performance Results –Andrew Benchmark –Stream Benchmark Conclusion

Closer look: Dissemination Tree Primary Replicas HotOS Attendee Other Researchers Archival Servers Secondary Replicas

Closer look: Dissemination Tree Self-organizing application-level multicast tree –Connects all secondary replicas to primary ones –Shields primary replicas from request load –Save bandwidth on consistency traffic Tree joining heuristic (“first-order” solution): –Connect to closest replica using Tapestry Take advantage of Tapestry’s locality properties –Should minimize use of long-distance links –A sort of poor man’s CDN

Performance Results: Stream Benchmark Goal: measure efficiency of dissemination tree –Multicast tree between secondary replicas Ran 500 virtual nodes on PlanetLab –Primary replicas in SF Bay Area –Other replicas clustered in 7 largest PlanetLab sites Streams writes to all replicas –One content creator repeatedly appends to one object –Other replicas read new versions as they arrive –Measure network resource consumption

Performance Results: Stream Benchmark Dissemination tree uses network resources efficiently –Most bytes sent across local links as second tier grows Acceptable latency increase over broadcast (33%)

Related Work Distributed Storage –Traditional: AFS, CODA, Bayou –Peer-to-peer: PAST, CFS, Ivy Byzantine fault tolerant storage –Castro-Liskov, COCA, Fleet Threshold signatures –COCA, Fleet Erasure codes –Intermemory, Pasis, Mnemosyne, Free Haven Others –Publius, Freenet, Eternity Service, SUNDR

Conclusion OceanStore designed as a global-scale file system Design meets primary challenges –End-to-end encryption for privacy –Limited trust in any one host for integrity –Self-organizing and maintaining to increase usability Pond prototype functional –Threshold signatures more expensive than expected –Simple dissemination tree fairly effective –A good base for testing new ideas

More Information and Code Availability More OceanStore work –Overview: ASPLOS 2000 –Tapestry: SPAA 2002 More papers and code for Pond available at