1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas at Austin
2 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
3 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
4 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
5 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
6 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
7 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
8 Trends Large wide-area networked systems Enterprise networks IBM 170 countries > employees Computational Grids NCSA Teragrid 10 partners and growing nodes per site Sensor networks Navy Automated Maintenance Environment About 300 ships in US Navy 200,000 sensors in a destroyer [3eti.com]
9 Research Vision Security Wide-area Distributed Operating System Goals : Ease building applications Utilize resources efficiently Monitoring Data Management Scheduling Information Management
10 Information Management Most large-scale distributed applications Monitor, query, and react to changes in the system Examples: A general information management middleware Eases design and development Avoids repetition of same task by different applications Provides a framework to explore tradeoffs Optimizes system performance Job Scheduling System administration and management Service location Sensor monitoring and control File location service Multicast service Naming and request routing ……
11 Contributions – SDIMS Meets key requirements Scalability Scale with both nodes and information to be managed Flexibility Enable applications to control the aggregation Autonomy Enable administrators to control flow of information Robustness Handle failures gracefully Scalable Distributed Information Management System
12 SDIMS in Brief Scalability Hierarchical aggregation Multiple aggregation trees Flexibility Separate mechanism from policy API for applications to choose a policy A self-tuning aggregation mechanism Autonomy Preserve organizational structure in all aggregation trees Robustness Default lazy re-aggregation upon failures On demand fast reaggregation
13 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
14 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
15 Attributes Information at machines Machine status information File information Multicast subscription information …… AttributeValue numUsers5 cpuLoad0.5 freeMem567MB totMem2GB fileFooyes mcastSess1yes
16 Aggregation Function Defined for an attribute Given values for a set of nodes Computes aggregate value Examples Total users logged in the system Attribute – numUsers Aggregation function – summation
17 Aggregation Trees Aggregation tree Physical machines are leaves Each virtual node represents a logical group of machines Administrative domains Groups within domains Aggregation function, f, for attribute A Computes the aggregated value A i for level-i subtree A 0 = locally stored value at the physical node or NULL A i = f(A i-1 0, A i-1 1, …, A i-1 k ) for virtual node with k children Each virtual node is simulated by some machines a b c d A0A0 A1A1 A2A2 f(a,b) f(c,d) f(f(a,b), f(c,d))
18 Example Queries Job scheduling system Find the least loaded machine Find a (nearby) machine with load < 0.5 File location system Locate a (nearby) machine with file “foo”
19 Example – Machine Loads Attribute: “minLoad” Value at a machine M with load L is ( M, L ) Aggregation function MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad
20 Example – Machine Loads Attribute: “minLoad” Value at a machine M with load L is ( M, L ) Aggregation function MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad Query: Tell me the least loaded machine.
21 Example – Machine Loads Attribute: “minLoad” Value at a machine M with load L is ( M, L ) Aggregation function MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad Query: Tell me a (nearby) machine with load < 0.5.
22 Example – File Location Attribute: “fileFoo” Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function SELECT_ONE(set of machine ids) B B C null B C fileFoo
23 Example – File Location Attribute: “fileFoo” Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function SELECT_ONE(set of machine ids) B B C null B C fileFoo Query: Tell me a (nearby) machine with file “Foo”.
24 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
25 Scalability To be a basic building block, SDIMS should support Large number of machines (> 10 4 ) Enterprise and global-scale services Applications with a large number of attributes (> 10 6 ) File location system Each file is an attribute Large number of attributes
26 Scalability Challenge Single tree for aggregation Astrolabe, SOMO, Ganglia, etc. Limited scalability with attributes Example: File Location f1, f2f2, f3f4, f5f6, f7 f1, f2, f3 f4,f5, f6,f7 f1,f2,…,f7
27 Scalability Challenge Single tree for aggregation Astrolabe, SOMO, Ganglia, etc. Limited scalability with attributes Example: File Location f1, f2f2, f3f4, f5f6, f7 f1, f2, f3 f4,f5, f6,f7 f1,f2,…,f7 Automatically build multiple trees for aggregation Aggregate different attributes along different trees
28 Building Aggregation Trees Leverage Distributed Hash Tables A DHT can be viewed as multiple aggregation trees Distributed Hash Tables (DHT) Supports hash table interfaces put (key, value): inserts value for key get (key): returns values associated with key Buckets for keys distributed among machines Several algorithms with different properties PRR, Pastry, Tapestry, CAN, CHORD, SkipNet, etc. Load-balancing, robustness, etc.
29 DHT - Overview Machine IDs and keys: Long bit vectors Owner of a key = Machine with ID closest to the key Bit correction for routing Each machine keeps O(log n) neighbors Key = get(11111)
30 DHT Trees as Aggregation Trees xx 11x 111 Key = 11111
31 DHT Trees as Aggregation Trees xx 11x 111 Mapping from virtual nodes to real machines Key = 11111
32 DHT Trees as Aggregation Trees xx 11x xx 00x 000 Key = Key = xx 11x 111
33 DHT Trees as Aggregation Trees xx 00x 000 Key = Key = xx 11x 111 Aggregate different attributes along different trees hash(“minLoad”) = aggregate minLoad along tree for key 00010
34 Scalability Challenge: Scale with both machines and attributes Our approach Build multiple aggregation trees Leverage well-studied DHT algorithms Load-balancing Self-organizing Locality Aggregate different attributes along different trees Aggregate attribute A along the tree for key = hash(A)
35 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
36 Flexibility Challenge When to aggregate? On reads? or on writes? Attributes with different read-write ratios read-write ratio #writes >> #reads #reads >> #writes CPU Load Total Mem { File Location Astrolabe Ganglia Sophia MDS-2 DHT based systems Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes
37 Flexibility Challenge When to aggregate? On reads? or on writes? Attributes with different read-write ratios read-write ratio CPU Load Total Mem { File Location Astrolabe Ganglia Sophia MDS DHT based systems Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes Single framework – separate mechanism from policy Allow applications to choose any policy Provide self-tuning mechanism #writes >> #reads #reads >> #writes
38 Install: an aggregation function for an attribute Function is propagated to all nodes Arguments up and down specify an aggregation policy Update: the value of a particular attribute Aggregation performed according to the chosen policy Probe: for an aggregated value at some level If required, aggregation is done Two modes: one-shot and continuous Install Update Probe API Exposed to Applications
39 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0
40 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0
41 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0
42 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0
43 Self-tuning Aggregation Some apps can forecast their read-write rates What about others? Can not or do not want to specify Spatial heterogeneity Temporal heterogeneity Shruti: Dynamically tunes aggregation Keeps track of read and write patterns
44 Shruti – Dynamic Adaptation Update-Up Up=all Down=0 R A
45 Shruti – Dynamic Adaptation Update-Up Up=all Down=0 A Lease based mechanism Any updates are forwarded until lease is relinquished R A
46 Shruti – In Brief On each node Tracks updates and probes Both local and from neighbors Sets and removes leases Grants leases to a neighbor A When gets k probes from A while no updates happen Relinquishes leases from a neighbor A When gets m updates from A while no probes happen
47 Flexibility Challenge Support applications with different read-write behavior Our approach Separate mechanism from policy Let applications specify an aggregation policy Up and Down knobs in Install interface Provide a lease based self-tuning aggregation strategy
48 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
49 Administrative Autonomy Systems spanning multiple administrative domains Allow a domain administrator control information flow Prevent external observer from observing the information Prevent external failures from affecting the operations Challenge DHT trees might not conform A B C D
50 Administrative Autonomy A B C D Our approach: Autonomous DHTs Two properties Path locality Path convergence Ensure that virtual nodes aggregating data of a domain are hosted on machines in the domain }
51 Autonomy – Example cs.utexas.edu ece.utexas.edu phy.utexas.edu L0 L2 L1 L3 Path Locality Path Convergence
52 Autonomy – Challenge DHT trees might not conform Example: DHT tree for key = 111 Autonomous DHT with two properties Path Locality Path Convergence domain1 L0 L2 L1 L3
53 Robustness Large scale system failures are common Handle failures gracefully Enable applications to tradeoff Cost of adaptation, Response latency, and Consistency Techniques Tree repair Leverage DHT self-organizing properties Aggregated information repair Default lazy re-aggregation on failures On-demand fast re-aggregation
54 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
55 Evaluation SDIMS prototype Built using FreePastry DHT framework [Rice Univ.] Three layers Methodology Simulation Scalability and Flexibility Micro-benchmarks on real networks PlanetLab and CS Department Aggregation Mgmt. Tree Topology Mgmt. Autonomous DHT
56 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS #machines Max
57 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS Max Orders of magnitude difference in maximum node stress better load balance
58 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS Decreasing max load Increasing max load Max
59 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS SDIMS 256 SDIMS 4096 SDIMS Max Central 256 Central 4096 Central 65536
60 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, Down=0 Up=all, Down=5
61 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, Down=0 Up=all, Down=5 Astrolabe Ganglia DHT Based Systems Sophia MDS-2
62 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, down=0 Up=all, Down=5 Writes dominate reads: Update-local best Reads dominate writes: Update-All best
63 Dynamic Adaptation Avg Message Count Read-to-write ratio Simulation with 512 nodes Update-All Update-None Up=3, Down=0 Up=all, Down=3 Update-Up Shruti
64 Prototype Results CS department: 180 machines PlanetLab: 70 machines Department Network Update - AllUpdate - UpUpdate - Local Latency (ms) Planet Lab Update - AllUpdate - UpUpdate - Local
65 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design Scalability with machines and attributes Flexibility to accommodate various applications Autonomy to respect administrative structure Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions
66 SDIMS in Other Projects PRACTI – a replication toolkit (Dahlin et al) Grid Services (TACC) Resource Scheduling Data management INSIGHT: Network Monitoring (Jain and Zhang) File location Service (IBM) Scalable Sensing Service (HP Labs)
67 PRACTI – A Replication Toolkit Partial Replication Arbitrary Consistency Topology Independence Ability to replicate partial content Allow several consistency policies Allow communication between any two machines
68 PRACTI – A Replication Toolkit Partial Replication Arbitrary Consistency Topology Independence Coda, Sprite Bayou, TACT Ficus, Pangaea PRACTI
69 PRACTI Design Core: Mechanism Controller: Policy Notified of key events Read Miss, update arrival, invalidation arrival, … Directs communication across cores Controller Core Inform Mgmt. read() write() delete() Invals & Updates from/to other nodes
70 SDIMS Controller in PRACTI Read Miss: For locating a replica Similar to “File Location System” example But handles flash crowds Dissemination tree among requesting clients For Writes: Spanning trees among replicas Multicast tree for spreading invalidations Different trees for different objects
71 PRACTI – Grid Benchmark Three phases Read input and programs Compute (some pairwise reads) Results back to server Performance improvement: 21% reduction in total time Home Grid at school
72 PRACTI Experience Aggregation abstraction and API generality Construct multicast trees for pushing invalidations Locate a replica on a local read miss Construct a tree in the case of flash crowds Performance benefits Grid micro-benchmark: 21% improvement over manual tree construction Ease of implementation Less than two weeks
73 Conclusions Research Vision Ease design and development of distributed services SDIMS – an information management middleware Scalability with both machines and attributes An order of magnitude lower maximum node stress Flexibility in aggregation strategies Support for a wide range of applications Autonomy Robustness to failures
74 Future Directions Core SDIMS research Composite queries Resilience to temporary reconfigurations Probe functions Other components of wide-area distributed OS Scheduling Data management Monitoring … Security Monitoring Data Management Scheduling Information Management
75 For more information:
76 SkipNet and Autonomy Constrained load balancing in Skipnet Also single level administrative domains One solution: Maintain separate rings in different domains ece.utexas.edu cs.utexas.edu phy.utexas.edu Does not form trees because of revisits
77 Load Balance Let f = fraction of attributes a node is interested in N = number of nodes in the system In DHT -- node will have O(log (N)) indegree whp
78 Related Work Other aggregation systems Astrolabe, SOMO, Dasis, IrisNet Single tree Cone Aggregation tree changes with new updates Ganglia, TAG, Sophia, and IBM Tivoli Monitoring System Database abstraction on DHTs PIER and Gribble et al 2001 Support for “join” operation Can be leveraged for answering composite queries
79 Load Balance How many attributes? O(log N) levels Few children at each level Each node interested in few attributes Level 0: d Level 1: 2 x d / 2 = d Level 2: 2 * (d) / 2 = c^2 * d/4 … Total = d * [ 1+ c/2+c^2/4+……] = O(d * log N)
80 PRACTI – Approach Bayou type log-exchange But allow partial replication Two key ideas Separate invalidations from updates partial replication of data Imprecise invalidations: summary of a set of invals partial replication of metadata
81 PRACTI For reads – Locate a replica on a read miss For writes – Construct spanning tree among replicas To propagate invalidations To propagate updates ControllerCore
82 SDIMS not yet another DHT system Typical DHT applications Use put and get interfaces in hashtable Aggregation as a general abstraction
83 Autonomy Increase in path length Path Convergence violations None in autonomous DHT bf=4 bf=16 bf=64 Pastry bf=4 bf=16 bf=64 Pastry ADHT bf = branching factor or nodes per domain
84 Autonomy Increase in path length Path Convergence violations None in autonomous DHT bf=4 bf=16 bf=64 Pastry bf=4 bf=16 bf=64 Pastry ADHT bf = branching factor or nodes per domain bf tree height bf #violations
85 Robustness Planet-Lab with 67 nodes Aggregation function: summation; Strategy: Update-Up Each node updates the attribute with value 10
86 Sparse attributes Attributes of interest to only few nodes Example: A file “foo” in file location application Key for scalability Challenge: Aggregation abstraction – one function per attribute Dilemma Separate aggregation function with each attribute Unnecessary storage and communication overheads A vector of values with one aggregation function Defeats DHT advantage
87 Sparse attributes Attributes of interest to only few nodes Example: A file “foo” in file location application Key for scalability Challenge: Aggregation abstraction – one function per attribute Dilemma Separate aggregation function with each attribute Unnecessary storage and communication overheads A vector of values with one aggregation function Defeats DHT advantage AttributeFunctionValue fileFooAggrFuncFileFoomacID fileBarAggrFuncFileBarmacID …………………
88 Sparse attributes Attributes of interest to only few nodes Example: A file “foo” in file location application Key for scalability Challenge: Aggregation abstraction – one function per attribute Dilemma Separate aggregation function with each attribute Unnecessary storage and communication overheads A vector of values with one aggregation function Defeats DHT advantage AttributeFunctionValue fileAggrFuncFileLoc (“foo”, “bar”, ……)
89 Novel Aggregation Abstraction Separate attribute type from attribute name Attribute = (attribute type, attribute name) Example: type=“fileLocation”, name=“fileFoo” Define aggregation function for a type Attr. TypeAttr. NameValue fileLocationfileFoo fileLocationfileBar MINcpuLoad(macA, 0.3) multicastmcastSess1yes IP addr: Name: macA Attr. TypeAggr Function fileLocationSELECT_ONE MIN multicastMULTICAST
90 Example – File Location Attribute: “fileFoo” Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function SELECT_TWO (set of machine ids) B, C B C null B C fileFoo Query: Tell me two machines with file “Foo”.
91 A Key Component Most large-scale distributed applications Monitor, query and react to changes in the system Examples: Fundamental building block Information collection and management System administration and management Service placement and location Sensor monitoring and control Distributed Denial-of-Service attack detection File location service Multicast tree construction Naming and request routing …………
92 CS Department Micro-benchmark Experiment
93 API Exposed to Applications API Applications Install (attrType, function, up, down) Update (attrType, attrName, Value) Probe (attrType, attrName, level, mode) TypeNameValue MINminLoad(A, 0.3) fileLocationfileFoo fileLocationfileBar TypeFunction minMIN fileLocationSELECT-ONE SDIMS at leaf node (level = 0)