1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas.

Slides:



Advertisements
Similar presentations
SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.
Advertisements

Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
SDN Controller Challenges
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Presented by Elisavet Kozyri. A distributed application architecture that partitions tasks or work loads between peers Main actions: Find the owner of.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Applications over P2P Structured Overlays Antonino Virgillito.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
OSMOSIS Final Presentation. Introduction Osmosis System Scalable, distributed system. Many-to-many publisher-subscriber real time sensor data streams,
SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.
SCALLOP A Scalable and Load-Balanced Peer- to-Peer Lookup Protocol for High- Performance Distributed System Jerry Chou, Tai-Yi Huang & Kuang-Li Huang Embedded.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
SkipNet: A Scaleable Overlay Network With Practical Locality Properties Presented by Rachel Rubin CS294-4: Peer-to-Peer Systems By Nicholas Harvey, Michael.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Concepts and Design Chapter 10: Peer-to-Peer Systems Bruce Hammer, Steve Wallis, Raymond Ho.
Managing Service Metadata as Context The 2005 Istanbul International Computational Science & Engineering Conference (ICCSE2005) Mehmet S. Aktas
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
The Impact of DHT Routing Geometry on Resilience and Proximity K. Gummadi, R. Gummadi..,S.Gribble, S. Ratnasamy, S. Shenker, I. Stoica.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
March 6th, 2008Andrew Ofstad ECE 256, Spring 2008 TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks Samuel Madden, Michael J. Franklin, Joseph.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
November 17, 2015Department of Computer Sciences, UT Austin1 SDIMS: A Scalable Distributed Information Management System Praveen Yalagandula Mike Dahlin.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
1 An Efficient, Low-Cost Inconsistency Detection Framework for Data and Service Sharing in an Internet-Scale System Yijun Lu †, Hong Jiang †, and Dan Feng.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Peer to Peer Network Design Discovery and Routing algorithms
January 19, 2016Department of Computer Sciences, UT Austin1 Shruti: Dynamically Adapting Aggregation Aggressiveness Praveen Yalagandula Mike Dahlin The.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Bruce Hammer, Steve Wallis, Raymond Ho
NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Demetrios Zeinalipour-Yazti (Univ. of Cyprus)
Distributed Hash Tables
(slides by Nick Feamster)
Accessing nearby copies of replicated objects
Early Measurements of a Cluster-based Architecture for P2P Systems
Providing Secure Storage on the Internet
EE 122: Lecture 22 (Overlay Networks)
Presentation transcript:

1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas at Austin

2 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

3 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

4 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

5 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

6 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

7 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

8 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > employees  Computational Grids NCSA Teragrid  10 partners and growing  nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

9 Research Vision Security Wide-area Distributed Operating System Goals :  Ease building applications  Utilize resources efficiently Monitoring Data Management Scheduling Information Management

10 Information Management Most large-scale distributed applications  Monitor, query, and react to changes in the system  Examples: A general information management middleware  Eases design and development  Avoids repetition of same task by different applications  Provides a framework to explore tradeoffs  Optimizes system performance Job Scheduling System administration and management Service location Sensor monitoring and control File location service Multicast service Naming and request routing ……

11 Contributions – SDIMS Meets key requirements  Scalability Scale with both nodes and information to be managed  Flexibility Enable applications to control the aggregation  Autonomy Enable administrators to control flow of information  Robustness Handle failures gracefully Scalable Distributed Information Management System

12 SDIMS in Brief Scalability  Hierarchical aggregation  Multiple aggregation trees Flexibility  Separate mechanism from policy API for applications to choose a policy  A self-tuning aggregation mechanism Autonomy  Preserve organizational structure in all aggregation trees Robustness  Default lazy re-aggregation upon failures  On demand fast reaggregation

13 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

14 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

15 Attributes Information at machines  Machine status information  File information  Multicast subscription information  …… AttributeValue numUsers5 cpuLoad0.5 freeMem567MB totMem2GB fileFooyes mcastSess1yes

16 Aggregation Function Defined for an attribute Given values for a set of nodes  Computes aggregate value Examples  Total users logged in the system Attribute – numUsers Aggregation function – summation

17 Aggregation Trees Aggregation tree  Physical machines are leaves  Each virtual node represents a logical group of machines Administrative domains Groups within domains Aggregation function, f, for attribute A  Computes the aggregated value A i for level-i subtree A 0 = locally stored value at the physical node or NULL A i = f(A i-1 0, A i-1 1, …, A i-1 k ) for virtual node with k children Each virtual node is simulated by some machines a b c d A0A0 A1A1 A2A2 f(a,b) f(c,d) f(f(a,b), f(c,d))

18 Example Queries Job scheduling system  Find the least loaded machine  Find a (nearby) machine with load < 0.5 File location system  Locate a (nearby) machine with file “foo”

19 Example – Machine Loads Attribute: “minLoad”  Value at a machine M with load L is ( M, L ) Aggregation function  MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad

20 Example – Machine Loads Attribute: “minLoad”  Value at a machine M with load L is ( M, L ) Aggregation function  MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad Query: Tell me the least loaded machine.

21 Example – Machine Loads Attribute: “minLoad”  Value at a machine M with load L is ( M, L ) Aggregation function  MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad Query: Tell me a (nearby) machine with load < 0.5.

22 Example – File Location Attribute: “fileFoo”  Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function  SELECT_ONE(set of machine ids) B B C null B C fileFoo

23 Example – File Location Attribute: “fileFoo”  Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function  SELECT_ONE(set of machine ids) B B C null B C fileFoo Query: Tell me a (nearby) machine with file “Foo”.

24 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

25 Scalability To be a basic building block, SDIMS should support  Large number of machines (> 10 4 ) Enterprise and global-scale services  Applications with a large number of attributes (> 10 6 ) File location system  Each file is an attribute  Large number of attributes

26 Scalability Challenge Single tree for aggregation  Astrolabe, SOMO, Ganglia, etc.  Limited scalability with attributes  Example: File Location f1, f2f2, f3f4, f5f6, f7 f1, f2, f3 f4,f5, f6,f7 f1,f2,…,f7

27 Scalability Challenge Single tree for aggregation  Astrolabe, SOMO, Ganglia, etc.  Limited scalability with attributes  Example: File Location f1, f2f2, f3f4, f5f6, f7 f1, f2, f3 f4,f5, f6,f7 f1,f2,…,f7  Automatically build multiple trees for aggregation  Aggregate different attributes along different trees

28 Building Aggregation Trees Leverage Distributed Hash Tables  A DHT can be viewed as multiple aggregation trees Distributed Hash Tables (DHT)  Supports hash table interfaces put (key, value): inserts value for key get (key): returns values associated with key  Buckets for keys distributed among machines  Several algorithms with different properties PRR, Pastry, Tapestry, CAN, CHORD, SkipNet, etc. Load-balancing, robustness, etc.

29 DHT - Overview  Machine IDs and keys: Long bit vectors  Owner of a key = Machine with ID closest to the key  Bit correction for routing  Each machine keeps O(log n) neighbors Key = get(11111)

30 DHT Trees as Aggregation Trees xx 11x 111 Key = 11111

31 DHT Trees as Aggregation Trees xx 11x 111 Mapping from virtual nodes to real machines Key = 11111

32 DHT Trees as Aggregation Trees xx 11x xx 00x 000 Key = Key = xx 11x 111

33 DHT Trees as Aggregation Trees xx 00x 000 Key = Key = xx 11x 111 Aggregate different attributes along different trees hash(“minLoad”) =  aggregate minLoad along tree for key 00010

34 Scalability Challenge:  Scale with both machines and attributes Our approach  Build multiple aggregation trees Leverage well-studied DHT algorithms  Load-balancing  Self-organizing  Locality  Aggregate different attributes along different trees Aggregate attribute A along the tree for key = hash(A)

35 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

36 Flexibility Challenge When to aggregate?  On reads? or on writes?  Attributes with different read-write ratios read-write ratio #writes >> #reads #reads >> #writes CPU Load Total Mem { File Location Astrolabe Ganglia Sophia MDS-2 DHT based systems Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes

37 Flexibility Challenge When to aggregate?  On reads? or on writes?  Attributes with different read-write ratios read-write ratio CPU Load Total Mem { File Location Astrolabe Ganglia Sophia MDS DHT based systems Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes Single framework – separate mechanism from policy  Allow applications to choose any policy  Provide self-tuning mechanism #writes >> #reads #reads >> #writes

38 Install: an aggregation function for an attribute  Function is propagated to all nodes  Arguments up and down specify an aggregation policy Update: the value of a particular attribute  Aggregation performed according to the chosen policy Probe: for an aggregated value at some level  If required, aggregation is done  Two modes: one-shot and continuous Install Update Probe API Exposed to Applications

39 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0

40 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0

41 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0

42 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0

43 Self-tuning Aggregation Some apps can forecast their read-write rates What about others?  Can not or do not want to specify  Spatial heterogeneity  Temporal heterogeneity Shruti: Dynamically tunes aggregation  Keeps track of read and write patterns

44 Shruti – Dynamic Adaptation Update-Up Up=all Down=0 R A

45 Shruti – Dynamic Adaptation Update-Up Up=all Down=0 A Lease based mechanism Any updates are forwarded until lease is relinquished R A

46 Shruti – In Brief On each node  Tracks updates and probes Both local and from neighbors  Sets and removes leases Grants leases to a neighbor A  When gets k probes from A while no updates happen Relinquishes leases from a neighbor A  When gets m updates from A while no probes happen

47 Flexibility Challenge  Support applications with different read-write behavior Our approach  Separate mechanism from policy  Let applications specify an aggregation policy Up and Down knobs in Install interface  Provide a lease based self-tuning aggregation strategy

48 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

49 Administrative Autonomy Systems spanning multiple administrative domains Allow a domain administrator control information flow  Prevent external observer from observing the information  Prevent external failures from affecting the operations Challenge  DHT trees might not conform A B C D

50 Administrative Autonomy A B C D Our approach: Autonomous DHTs Two properties  Path locality  Path convergence Ensure that virtual nodes aggregating data of a domain are hosted on machines in the domain }

51 Autonomy – Example cs.utexas.edu ece.utexas.edu phy.utexas.edu L0 L2 L1 L3 Path Locality Path Convergence

52 Autonomy – Challenge DHT trees might not conform  Example: DHT tree for key = 111 Autonomous DHT with two properties  Path Locality  Path Convergence domain1 L0 L2 L1 L3

53 Robustness Large scale system  failures are common  Handle failures gracefully  Enable applications to tradeoff Cost of adaptation, Response latency, and Consistency Techniques  Tree repair Leverage DHT self-organizing properties  Aggregated information repair Default lazy re-aggregation on failures On-demand fast re-aggregation

54 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

55 Evaluation SDIMS prototype  Built using FreePastry DHT framework [Rice Univ.]  Three layers Methodology  Simulation Scalability and Flexibility  Micro-benchmarks on real networks PlanetLab and CS Department Aggregation Mgmt. Tree Topology Mgmt. Autonomous DHT

56 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS #machines Max

57 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS Max Orders of magnitude difference in maximum node stress  better load balance

58 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS Decreasing max load Increasing max load Max

59 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS Max Central 256 Central 4096 Central 65536

60 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, Down=0 Up=all, Down=5

61 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, Down=0 Up=all, Down=5 Astrolabe Ganglia DHT Based Systems Sophia MDS-2

62 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, down=0 Up=all, Down=5 Writes dominate reads: Update-local best Reads dominate writes: Update-All best

63 Dynamic Adaptation Avg Message Count Read-to-write ratio Simulation with 512 nodes Update-All Update-None Up=3, Down=0 Up=all, Down=3 Update-Up Shruti

64 Prototype Results CS department: 180 machines PlanetLab: 70 machines Department Network Update - AllUpdate - UpUpdate - Local Latency (ms) Planet Lab Update - AllUpdate - UpUpdate - Local

65 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

66 SDIMS in Other Projects PRACTI – a replication toolkit (Dahlin et al) Grid Services (TACC)  Resource Scheduling  Data management INSIGHT: Network Monitoring (Jain and Zhang) File location Service (IBM) Scalable Sensing Service (HP Labs)

67 PRACTI – A Replication Toolkit Partial Replication Arbitrary Consistency Topology Independence Ability to replicate partial content Allow several consistency policies Allow communication between any two machines

68 PRACTI – A Replication Toolkit Partial Replication Arbitrary Consistency Topology Independence Coda, Sprite Bayou, TACT Ficus, Pangaea PRACTI

69 PRACTI Design Core: Mechanism Controller: Policy  Notified of key events Read Miss, update arrival, invalidation arrival, …  Directs communication across cores Controller Core Inform Mgmt. read() write() delete() Invals & Updates from/to other nodes

70 SDIMS Controller in PRACTI Read Miss: For locating a replica  Similar to “File Location System” example  But handles flash crowds Dissemination tree among requesting clients For Writes: Spanning trees among replicas  Multicast tree for spreading invalidations  Different trees for different objects

71 PRACTI – Grid Benchmark Three phases  Read input and programs  Compute (some pairwise reads)  Results back to server Performance improvement:  21% reduction in total time Home Grid at school

72 PRACTI Experience Aggregation abstraction and API generality  Construct multicast trees for pushing invalidations  Locate a replica on a local read miss Construct a tree in the case of flash crowds Performance benefits  Grid micro-benchmark: 21% improvement over manual tree construction Ease of implementation  Less than two weeks

73 Conclusions Research Vision  Ease design and development of distributed services SDIMS – an information management middleware  Scalability with both machines and attributes An order of magnitude lower maximum node stress  Flexibility in aggregation strategies Support for a wide range of applications  Autonomy  Robustness to failures

74 Future Directions Core SDIMS research  Composite queries  Resilience to temporary reconfigurations  Probe functions Other components of wide-area distributed OS  Scheduling  Data management  Monitoring  … Security Monitoring Data Management Scheduling Information Management

75 For more information:

76 SkipNet and Autonomy Constrained load balancing in Skipnet Also single level administrative domains One solution: Maintain separate rings in different domains ece.utexas.edu cs.utexas.edu phy.utexas.edu Does not form trees because of revisits

77 Load Balance Let  f = fraction of attributes a node is interested in  N = number of nodes in the system In DHT -- node will have O(log (N)) indegree whp

78 Related Work Other aggregation systems  Astrolabe, SOMO, Dasis, IrisNet Single tree  Cone Aggregation tree changes with new updates  Ganglia, TAG, Sophia, and IBM Tivoli Monitoring System Database abstraction on DHTs  PIER and Gribble et al 2001  Support for “join” operation Can be leveraged for answering composite queries

79 Load Balance How many attributes? O(log N) levels Few children at each level Each node interested in few attributes Level 0: d Level 1: 2 x d / 2 = d Level 2: 2 * (d) / 2 = c^2 * d/4 … Total = d * [ 1+ c/2+c^2/4+……] = O(d * log N)

80 PRACTI – Approach Bayou type log-exchange  But allow partial replication Two key ideas  Separate invalidations from updates  partial replication of data  Imprecise invalidations: summary of a set of invals  partial replication of metadata

81 PRACTI For reads – Locate a replica on a read miss For writes – Construct spanning tree among replicas  To propagate invalidations  To propagate updates ControllerCore

82 SDIMS not yet another DHT system Typical DHT applications  Use put and get interfaces in hashtable Aggregation as a general abstraction

83 Autonomy Increase in path length Path Convergence violations  None in autonomous DHT bf=4 bf=16 bf=64 Pastry bf=4 bf=16 bf=64 Pastry ADHT bf = branching factor or nodes per domain

84 Autonomy Increase in path length Path Convergence violations  None in autonomous DHT bf=4 bf=16 bf=64 Pastry bf=4 bf=16 bf=64 Pastry ADHT bf = branching factor or nodes per domain bf   tree height  bf   #violations 

85 Robustness  Planet-Lab with 67 nodes Aggregation function: summation; Strategy: Update-Up Each node updates the attribute with value 10

86 Sparse attributes Attributes of interest to only few nodes  Example: A file “foo” in file location application  Key for scalability Challenge:  Aggregation abstraction – one function per attribute  Dilemma Separate aggregation function with each attribute  Unnecessary storage and communication overheads A vector of values with one aggregation function  Defeats DHT advantage

87 Sparse attributes Attributes of interest to only few nodes  Example: A file “foo” in file location application  Key for scalability Challenge:  Aggregation abstraction – one function per attribute  Dilemma Separate aggregation function with each attribute  Unnecessary storage and communication overheads A vector of values with one aggregation function  Defeats DHT advantage AttributeFunctionValue fileFooAggrFuncFileFoomacID fileBarAggrFuncFileBarmacID …………………

88 Sparse attributes Attributes of interest to only few nodes  Example: A file “foo” in file location application  Key for scalability Challenge:  Aggregation abstraction – one function per attribute  Dilemma Separate aggregation function with each attribute  Unnecessary storage and communication overheads A vector of values with one aggregation function  Defeats DHT advantage AttributeFunctionValue fileAggrFuncFileLoc (“foo”, “bar”, ……)

89 Novel Aggregation Abstraction Separate attribute type from attribute name  Attribute = (attribute type, attribute name)  Example: type=“fileLocation”, name=“fileFoo” Define aggregation function for a type Attr. TypeAttr. NameValue fileLocationfileFoo fileLocationfileBar MINcpuLoad(macA, 0.3) multicastmcastSess1yes IP addr: Name: macA Attr. TypeAggr Function fileLocationSELECT_ONE MIN multicastMULTICAST

90 Example – File Location Attribute: “fileFoo”  Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function  SELECT_TWO (set of machine ids) B, C B C null B C fileFoo Query: Tell me two machines with file “Foo”.

91 A Key Component Most large-scale distributed applications  Monitor, query and react to changes in the system  Examples: Fundamental building block Information collection and management System administration and management Service placement and location Sensor monitoring and control Distributed Denial-of-Service attack detection File location service Multicast tree construction Naming and request routing …………

92 CS Department Micro-benchmark Experiment

93 API Exposed to Applications API Applications Install (attrType, function, up, down) Update (attrType, attrName, Value) Probe (attrType, attrName, level, mode) TypeNameValue MINminLoad(A, 0.3) fileLocationfileFoo fileLocationfileBar TypeFunction minMIN fileLocationSELECT-ONE SDIMS at leaf node (level = 0)