Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

Slides:



Advertisements
Similar presentations
Brocade: Landmark Routing on Peer to Peer Networks Ben Y. Zhao Yitao Duan, Ling Huang, Anthony Joseph, John Kubiatowicz IPTPS, March 2002.
Advertisements

Perspective on Overlay Networks Panel: Challenges of Computing on a Massive Scale Ben Y. Zhao FuDiCo 2002.
Tapestry: Scalable and Fault-tolerant Routing and Location Stanford Networking Seminar October 2001 Ben Y. Zhao
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Exploiting Route Redundancy via Structured Peer to Peer Overlays Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony D. Joseph, and John D. Kubiatowicz.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
*Towards A Common API for Structured Peer-to-Peer Overlays Frank Dabek, Ben Y. Zhao, Peter Druschel, John Kubiatowicz, Ion Stoica MIT, U. C. Berkeley,
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Rapid Mobility via Type Indirection Ben Y. Zhao, Ling Huang, Anthony D. Joseph, John D. Kubiatowicz Computer Science Division, UC Berkeley IPTPS 2004.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
The Oceanstore Regenerative Wide-area Location Mechanism Ben Zhao John Kubiatowicz Anthony Joseph Endeavor Retreat, June 2000.
1 Towards a Common API for Structured Peer-to-Peer Overlays Frank Dabek, Ben Zhao, Peter Druschel, John Kubiatowicz, Ion Stoica Presented for Cs294-4 by.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Brocade Landmark Routing on Structured P2P Overlays Ben Zhao, Yitao Duan, Ling Huang Anthony Joseph and John Kubiatowicz (IPTPS 2002) Goals Improve routing.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
SCRIBE: A large-scale and decentralized application-level multicast infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron.
Supporting Rapid Mobility via Locality in an Overlay Network Ben Y. Zhao Anthony D. Joseph John D. Kubiatowicz Sahara / OceanStore Joint Session June 10,
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter: Chunyuan Liao March 6, 2002 Ben Y.Zhao, John Kubiatowicz, and.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January.
Tapestry on PlanetLab Deployment Experiences and Applications Ben Zhao, Ling Huang, Anthony Joseph, John Kubiatowicz.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
CITRIS Poster Supporting Wide-area Applications Complexities of global deployment  Network unreliability.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
An Evaluation of Scalable Application-level Multicast Using Peer-to-peer Overlays Miguel Castro, Michael B. Jones, Anne-Marie Kermarrec, Antony Rowstron,
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
*Towards A Common API for Structured Peer-to-Peer Overlays Frank Dabek, Ben Y. Zhao, Peter Druschel, John Kubiatowicz, Ion Stoica MIT, U. C. Berkeley,
Long Term Durability with Seagull Hakim Weatherspoon (Joint work with Jeremy Stribling and OceanStore group) University of California, Berkeley ROC/Sahara/OceanStore.
Tapestry An off-the-wall routing protocol? Presented by Peter, Erik, and Morten.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Locality Aware Mechanisms for Large-scale Networks Ben Y. Zhao Anthony D. Joseph John D. Kubiatowicz UC Berkeley Future Directions in Distributed Computing.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 Plaxton Routing. 2 Introduction Plaxton routing is a scalable mechanism for accessing nearby copies of objects. Plaxton mesh is a data structure that.
PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.
Arnold N. Pears, CoRE Group Uppsala University 3 rd Swedish Networking Workshop Marholmen, September Why Tapestry is not Pastry Presenter.
Introduction of P2P systems
Brocade Landmark Routing on P2P Networks Gisik Kwon April 9, 2002.
Tapestry:A Resilient Global- Scale Overlay for Service Deployment Zhao, Huang, Stribling, Rhea, Joseph, Kubiatowicz Presented by Rebecca Longmuir.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
Tapestry: A Resilient Global-scale Overlay for Service Deployment 1 Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Peer to Peer Network Design Discovery and Routing algorithms
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.
Accessing nearby copies of replicated objects
An Overlay Infrastructure for Decentralized Object Location and Routing Ben Y. Zhao University of California at Santa Barbara.
EE 122: Peer-to-Peer (P2P) Networks
John D. Kubiatowicz UC Berkeley
Rapid Mobility via Type Indirection
Tapestry: Scalable and Fault-tolerant Routing and Location
Brocade: Landmark Routing on Peer to Peer Networks
Simultaneous Insertions in Tapestry
Presentation transcript:

Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS, UC Berkeley HP Labs, November 2002

HP Labs, 11/26/02 © Next-generation Network Applications Scalability in applications Process/threads on single node server Cluster (LAN): fast, reliable, unlimited comm. Next step: scaling to the wide-area Complexities of global deployment Network unreliability BGP slow convergence, redundancy unexploited Lack of administrative control over components Constrains protocol deployment: multicast, congestion ctrl. Management of large scale resources / components Locate, utilize resources despite failures

HP Labs, 11/26/02 © Enabling Technology: DOLR (Decentralized Object Location and Routing) GUID1 DOLR GUID1 GUID2

HP Labs, 11/26/02 © A Solution Decentralized Object Location and Routing (DOLR) wide-area overlay application infrastructure Self-organizing, scalable Fault-tolerant routing and object location Efficient (b/w, latency) data delivery Extensible, supports application-specific protocols Recent work Tapestry, Chord, CAN, Pastry Kademlia, Viceroy, …

HP Labs, 11/26/02 © What is Tapestry? DOLR driving OceanStore global storage (Zhao, Kubiatowicz, Joseph et al. 2000) Network structure Nodes assigned bit sequence nodeIds namespace: , based on some radix (e.g. 16) keys from same namespace Keys dynamically map to 1 unique live node: root Base API Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID)

HP Labs, 11/26/02 © Tapestry Mesh Incremental prefix-based routing NodeID 0x43FE NodeID 0xEF31 NodeID 0xEFBA NodeID 0x0921 NodeID 0xE932 NodeID 0xEF37 NodeID 0xE324 NodeID 0xEF97 NodeID 0xEF32 NodeID 0xFF37 NodeID 0xE555 NodeID 0xE530 NodeID 0xEF44 NodeID 0x0999 NodeID 0x099F NodeID 0xE399 NodeID 0xEF40 NodeID 0xEF34

HP Labs, 11/26/02 © Routing Mesh Routing via local routing tables Based on incremental prefix matching Example: 5324 routes to 0629 via 5324  0231  0667  0625  0629 At n th hop, local node matches destination at least n digits (if any such node exists) i th entry in n th level routing table points to nearest node matching: prefix(local_ID, n)+i Properties At most log(N) # of overlay hops between nodes Routing table size: b  log(N) Actual entries have c-1 backups, total size: c  b  log(N)

HP Labs, 11/26/02 © Object Location Randomization and Locality

HP Labs, 11/26/02 © Object Location Distribute replicates of object references Only references, not the data itself (level of indirection) Place more of them closer to object itself Publication Place object location pointers into network Store hops between object and “root” node Location Route message towards root from client Redirect to object when location pointer found

HP Labs, 11/26/02 © Node Insertion Inserting new node N Notify need-to-know nodes of N, N fills null entries in their routing tables Move locally rooted object references to N Construct locally optimal routing table for N Notify nearby nodes to N for optimization Two phase node insertion Acknowledged multicast Nearest neighbor approximation

HP Labs, 11/26/02 © Acknowledged Multicast Reach need-to-know nodes of N (e.g. 3111) Add to routing table Move root object references G N

HP Labs, 11/26/02 © Nearest Neighbor N iterates: list = need-to-know nodes, L = prefix (N, S) Measure distances of List, use to fill routing table, level L Trim to k closest nodes, list = backpointers from k set, L-- Repeat until L == 0 N Need-to-know nodes

HP Labs, 11/26/02 © Talk Outline Algorithms Architecture Architectural components Extensibility API Evaluation Ongoing Projects Conclusion

HP Labs, 11/26/02 © Single Tapestry Node Transport Protocols Network Link Management Application Interface / Upcall API Decentralized File Systems Application-Level Multicast Approximate Text Matching Router Routing Table & Object Pointer DB Dynamic Node Management

HP Labs, 11/26/02 © Single Node Implementation Application Programming Interface Applications Dynamic TapestryCore Router Patchwork Network StageDistance Map SEDA Event-driven Framework Java Virtual Machine Enter/leave Tapestry State Maint. Node Ins/del Routing Link Maintenance Node Ins/del Messages UDP Pings route to node / obj API calls Upcalls fault detect heartbeat msgs

HP Labs, 11/26/02 © Message Routing Router: fast routing to nodes / objects Receive new location msg Forward to nextHop(h+1,G) Forward to nextHop(0,obj) Signal App Upcall Handler Upcall? yes no yes Have Obj Ptrs Receive new route msg Signal App Upcall Handler Upcall? no yes Forward to nextHop(h+1,G)

HP Labs, 11/26/02 © Extensibility API deliver (G, A id, Msg) Invoked at message destination Asynchronous, returns immediately forward (G, A id, Msg) Invoked at intermediate hop in route No action taken by default, application calls route() route (G, A id, Msg, NextHopNodeSet) Called by application to request message be routed to set of NextHopNodes

HP Labs, 11/26/02 © Local Operations Accessibility to Tapestry maintained state nextHopSet = Llookup(G, Num) Accesses routing table Returns up to num candidates for next hop towards G objReferenceSet = Lsearch(G, num) Searches object references for G Returns up to num references for object, sorted by increasing network distance

HP Labs, 11/26/02 © Deployment Status C simulator Packet level simulation Scales up to 10,000 nodes Java implementation semicolons of Java, 270 class files Deployed on local area cluster (40 nodes) Deployed on Planet Lab global network (~100 distributed nodes)

HP Labs, 11/26/02 © Talk Outline Algorithms Architecture Evaluation Micro-benchmarks Stable network performance Single and parallel node insertion Ongoing Projects Conclusion

HP Labs, 11/26/02 © Micro-benchmark Methodology Experiment run in LAN, GBit Ethernet Sender sends messages at full speed Measure inter-arrival time for last msgs msgs: remove cold-start effects msgs: remove network jitter effects Sender Control Receiver Control Tapestry LAN Link

HP Labs, 11/26/02 © Micro-benchmark Results Constant processing overhead ~ 50  s Latency dominated by byte copying For 5K messages, throughput = ~10,000 msgs/sec

HP Labs, 11/26/02 © Large Scale Methodology Planet Lab global network 98 machines at 42 institutions, in North America, Europe, Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM) North American machines (2/3) on Internet2 Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases latency

HP Labs, 11/26/02 © Node to Node Routing Ratio of end-to-end routing latency to shortest ping distance between nodes All node pairs measured, placed into buckets Median=31.5, 90 th percentile=135

HP Labs, 11/26/02 © Object Location Ratio of end-to-end latency for object location, to shortest ping distance between client and object location Each node publishes 10,000 objects, lookup on all objects 90 th percentile=158

HP Labs, 11/26/02 © Latency to Insert Node Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry Humps due to expected filling of each routing level

HP Labs, 11/26/02 © Bandwidth to Insert Node Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network Per node bandwidth decreases with size of network

HP Labs, 11/26/02 © Parallel Insertion Latency Latency to dynamically insert nodes in unison into an existing Tapestry of 200 Shown as function of insertion group size / network size 90 th percentile=55042

HP Labs, 11/26/02 © Results Summary Lessons Learned Node virtualization: resource contention Accurate network distances hard to measure Efficiency verified Msg processing = 50  s, Tput ~ 10,000msg/s Route to node/object small factor over optimal Algorithmic scalability Single node latency/bw scale sublinear to network size Parallel insertion scales linearly with group size

HP Labs, 11/26/02 © Talk Outline Algorithms Architecture Evaluation Ongoing Projects P2P landmark routing: Brocade Applications: Shuttle, Interweave, ATA Conclusion

HP Labs, 11/26/02 © State of the Art Routing High dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord, CAN, etc… Sub-linear storage and # of overlay hops per route Properties dependent on random name distribution Optimized for uniform mesh style networks

HP Labs, 11/26/02 © Reality AS-2 P2P Overlay Network AS-1 AS-3 SR Transit-stub topology, disparate resources per node Result: Inefficient inter-domain routing (b/w, latency)

HP Labs, 11/26/02 © Landmark Routing on P2P Brocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth Secondary overlay on top of Tapestry Select super-nodes by admin. domain Divide network into cover sets Super-nodes form secondary Tapestry Advertise cover set as local objects brocade routes directly into destination’s local network, then resumes p2p routing

HP Labs, 11/26/02 © AS-2 P2P Network AS-1 AS-3 Brocade Layer SD Original Route Brocade Route Brocade Routing

HP Labs, 11/26/02 © Applications under Development OceanStore: global resilient file store Shuttle Decentralized P2P chat service Leverages Tapestry for fault-tolerant routing Interweave Keyword searchable file sharing utility Fully decentralized, exploits network locality Approximate Text Addressing Uses text fingerprinting to map similar documents to single IDs Killer app: decentralized spam mail filter

HP Labs, 11/26/02 © For More Information Tapestry and related projects (and these slides): OceanStore: Related papers: