1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas.

1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas at Austin

2 Trends Large wide-area networked systems  Enterprise networks IBM  170 countries  > 330000 employees  Computational Grids NCSA Teragrid  10 partners and growing  100-1000 nodes per site  Sensor networks Navy Automated Maintenance Environment  About 300 ships in US Navy  200,000 sensors in a destroyer [3eti.com]

9 Research Vision Security Wide-area Distributed Operating System Goals :  Ease building applications  Utilize resources efficiently Monitoring Data Management Scheduling...... Information Management

10 Information Management Most large-scale distributed applications  Monitor, query, and react to changes in the system  Examples: A general information management middleware  Eases design and development  Avoids repetition of same task by different applications  Provides a framework to explore tradeoffs  Optimizes system performance Job Scheduling System administration and management Service location Sensor monitoring and control File location service Multicast service Naming and request routing ……

11 Contributions – SDIMS Meets key requirements  Scalability Scale with both nodes and information to be managed  Flexibility Enable applications to control the aggregation  Autonomy Enable administrators to control flow of information  Robustness Handle failures gracefully Scalable Distributed Information Management System

12 SDIMS in Brief Scalability  Hierarchical aggregation  Multiple aggregation trees Flexibility  Separate mechanism from policy API for applications to choose a policy  A self-tuning aggregation mechanism Autonomy  Preserve organizational structure in all aggregation trees Robustness  Default lazy re-aggregation upon failures  On demand fast reaggregation

13 Outline SDIMS: a general information management middleware Aggregation abstraction SDIMS Design  Scalability with machines and attributes  Flexibility to accommodate various applications  Autonomy to respect administrative structure  Robustness to failures Experimental results SDIMS in other projects Conclusions and future research directions

15 Attributes Information at machines  Machine status information  File information  Multicast subscription information  …… AttributeValue numUsers5 cpuLoad0.5 freeMem567MB totMem2GB fileFooyes mcastSess1yes

16 Aggregation Function Defined for an attribute Given values for a set of nodes  Computes aggregate value Examples  Total users logged in the system Attribute – numUsers Aggregation function – summation

17 Aggregation Trees Aggregation tree  Physical machines are leaves  Each virtual node represents a logical group of machines Administrative domains Groups within domains Aggregation function, f, for attribute A  Computes the aggregated value A i for level-i subtree A 0 = locally stored value at the physical node or NULL A i = f(A i-1 0, A i-1 1, …, A i-1 k ) for virtual node with k children Each virtual node is simulated by some machines a b c d A0A0 A1A1 A2A2 f(a,b) f(c,d) f(f(a,b), f(c,d))

18 Example Queries Job scheduling system  Find the least loaded machine  Find a (nearby) machine with load < 0.5 File location system  Locate a (nearby) machine with file “foo”

19 Example – Machine Loads Attribute: “minLoad”  Value at a machine M with load L is ( M, L ) Aggregation function  MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad

20 Example – Machine Loads Attribute: “minLoad”  Value at a machine M with load L is ( M, L ) Aggregation function  MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad Query: Tell me the least loaded machine.

21 Example – Machine Loads Attribute: “minLoad”  Value at a machine M with load L is ( M, L ) Aggregation function  MIN_LOAD (set of tuples) (C, 0.1) (A, 0.3) (C, 0.1) (A, 0.3) (B, 0.6) (C, 0.1) (D, 0.7) minLoad Query: Tell me a (nearby) machine with load < 0.5.

22 Example – File Location Attribute: “fileFoo”  Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function  SELECT_ONE(set of machine ids) B B C null B C fileFoo

23 Example – File Location Attribute: “fileFoo”  Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function  SELECT_ONE(set of machine ids) B B C null B C fileFoo Query: Tell me a (nearby) machine with file “Foo”.

25 Scalability To be a basic building block, SDIMS should support  Large number of machines (> 10 4 ) Enterprise and global-scale services  Applications with a large number of attributes (> 10 6 ) File location system  Each file is an attribute  Large number of attributes

26 Scalability Challenge Single tree for aggregation  Astrolabe, SOMO, Ganglia, etc.  Limited scalability with attributes  Example: File Location f1, f2f2, f3f4, f5f6, f7 f1, f2, f3 f4,f5, f6,f7 f1,f2,…,f7

27 Scalability Challenge Single tree for aggregation  Astrolabe, SOMO, Ganglia, etc.  Limited scalability with attributes  Example: File Location f1, f2f2, f3f4, f5f6, f7 f1, f2, f3 f4,f5, f6,f7 f1,f2,…,f7  Automatically build multiple trees for aggregation  Aggregate different attributes along different trees

28 Building Aggregation Trees Leverage Distributed Hash Tables  A DHT can be viewed as multiple aggregation trees Distributed Hash Tables (DHT)  Supports hash table interfaces put (key, value): inserts value for key get (key): returns values associated with key  Buckets for keys distributed among machines  Several algorithms with different properties PRR, Pastry, Tapestry, CAN, CHORD, SkipNet, etc. Load-balancing, robustness, etc.

29 DHT - Overview  Machine IDs and keys: Long bit vectors  Owner of a key = Machine with ID closest to the key  Bit correction for routing  Each machine keeps O(log n) neighbors 10010 10111 11000 11101 01100 01001 00110 00001 Key = 11111 get(11111)

30 DHT Trees as Aggregation Trees 010 001 100 101 110 111 011000 1xx 11x 111 Key = 11111

31 DHT Trees as Aggregation Trees 010 001 100 101 110 111 011000 1xx 11x 111 Mapping from virtual nodes to real machines Key = 11111

32 DHT Trees as Aggregation Trees 010 001 100 101 110 111 011000 1xx 11x 111 010 001 100 101 110 111 011000 0xx 00x 000 Key = 11111 Key = 00010 010 001 100 101 110 111 011000 1xx 11x 111

33 DHT Trees as Aggregation Trees 010 001 100 101 110 111 011000 0xx 00x 000 Key = 11111 Key = 00010 010 001 100 101 110 111 011000 1xx 11x 111 Aggregate different attributes along different trees hash(“minLoad”) = 00010  aggregate minLoad along tree for key 00010

34 Scalability Challenge:  Scale with both machines and attributes Our approach  Build multiple aggregation trees Leverage well-studied DHT algorithms  Load-balancing  Self-organizing  Locality  Aggregate different attributes along different trees Aggregate attribute A along the tree for key = hash(A)

36 Flexibility Challenge When to aggregate?  On reads? or on writes?  Attributes with different read-write ratios read-write ratio #writes >> #reads #reads >> #writes CPU Load Total Mem { File Location Astrolabe Ganglia Sophia MDS-2 DHT based systems Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes

37 Flexibility Challenge When to aggregate?  On reads? or on writes?  Attributes with different read-write ratios read-write ratio CPU Load Total Mem { File Location Astrolabe Ganglia Sophia MDS DHT based systems Best Policy Aggregate on reads Aggregate on writes Partial Aggregation on writes Single framework – separate mechanism from policy  Allow applications to choose any policy  Provide self-tuning mechanism #writes >> #reads #reads >> #writes

38 Install: an aggregation function for an attribute  Function is propagated to all nodes  Arguments up and down specify an aggregation policy Update: the value of a particular attribute  Aggregation performed according to the chosen policy Probe: for an aggregated value at some level  If required, aggregation is done  Two modes: one-shot and continuous Install Update Probe API Exposed to Applications

39 Flexibility Update-Local Up=0 Down=0 Policy Setting Update-All Up=all Down=all Update-Up Up=all Down=0

43 Self-tuning Aggregation Some apps can forecast their read-write rates What about others?  Can not or do not want to specify  Spatial heterogeneity  Temporal heterogeneity Shruti: Dynamically tunes aggregation  Keeps track of read and write patterns

44 Shruti – Dynamic Adaptation Update-Up Up=all Down=0 R A

45 Shruti – Dynamic Adaptation Update-Up Up=all Down=0 A Lease based mechanism Any updates are forwarded until lease is relinquished R A

46 Shruti – In Brief On each node  Tracks updates and probes Both local and from neighbors  Sets and removes leases Grants leases to a neighbor A  When gets k probes from A while no updates happen Relinquishes leases from a neighbor A  When gets m updates from A while no probes happen

47 Flexibility Challenge  Support applications with different read-write behavior Our approach  Separate mechanism from policy  Let applications specify an aggregation policy Up and Down knobs in Install interface  Provide a lease based self-tuning aggregation strategy

49 Administrative Autonomy Systems spanning multiple administrative domains Allow a domain administrator control information flow  Prevent external observer from observing the information  Prevent external failures from affecting the operations Challenge  DHT trees might not conform A B C D

50 Administrative Autonomy A B C D Our approach: Autonomous DHTs Two properties  Path locality  Path convergence Ensure that virtual nodes aggregating data of a domain are hosted on machines in the domain }

51 Autonomy – Example 010 001 100 101 110 111 011 000 cs.utexas.edu ece.utexas.edu phy.utexas.edu L0 L2 L1 L3 Path Locality Path Convergence

52 Autonomy – Challenge DHT trees might not conform  Example: DHT tree for key = 111 Autonomous DHT with two properties  Path Locality  Path Convergence 010 001 100 101 110 111 011 000 domain1 L0 L2 L1 L3

53 Robustness Large scale system  failures are common  Handle failures gracefully  Enable applications to tradeoff Cost of adaptation, Response latency, and Consistency Techniques  Tree repair Leverage DHT self-organizing properties  Aggregated information repair Default lazy re-aggregation on failures On-demand fast re-aggregation

55 Evaluation SDIMS prototype  Built using FreePastry DHT framework [Rice Univ.]  Three layers Methodology  Simulation Scalability and Flexibility  Micro-benchmarks on real networks PlanetLab and CS Department Aggregation Mgmt. Tree Topology Mgmt. Autonomous DHT

56 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS 65536 SDIMS 256 SDIMS 4096 SDIMS 65536 #machines Max

57 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS 65536 SDIMS 256 SDIMS 4096 SDIMS 65536 Max Orders of magnitude difference in maximum node stress  better load balance

58 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS 65536 SDIMS 256 SDIMS 4096 SDIMS 65536 Decreasing max load Increasing max load Max

59 Small multicast sessions with size 8 Node Stress = Amt. of incoming and outgoing info Simulation Results - Scalability AS 256 AS 4096 AS 65536 SDIMS 256 SDIMS 4096 SDIMS 65536 Max Central 256 Central 4096 Central 65536

60 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, Down=0 Up=all, Down=5

61 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, Down=0 Up=all, Down=5 Astrolabe Ganglia DHT Based Systems Sophia MDS-2

62 Simulation Results - Flexibility Simulation with 4096 nodes Attributes with different up and down strategies Update-Local Update-Up Update-All Up=5, down=0 Up=all, Down=5 Writes dominate reads: Update-local best Reads dominate writes: Update-All best

63 Dynamic Adaptation Avg Message Count Read-to-write ratio Simulation with 512 nodes Update-All Update-None Up=3, Down=0 Up=all, Down=3 Update-Up Shruti

64 Prototype Results CS department: 180 machines PlanetLab: 70 machines Department Network 0 100 200 300 400 500 600 700 800 Update - AllUpdate - UpUpdate - Local Latency (ms) Planet Lab 0 500 1000 1500 2000 2500 3000 3500 Update - AllUpdate - UpUpdate - Local

66 SDIMS in Other Projects PRACTI – a replication toolkit (Dahlin et al) Grid Services (TACC)  Resource Scheduling  Data management INSIGHT: Network Monitoring (Jain and Zhang) File location Service (IBM) Scalable Sensing Service (HP Labs)

67 PRACTI – A Replication Toolkit Partial Replication Arbitrary Consistency Topology Independence Ability to replicate partial content Allow several consistency policies Allow communication between any two machines

68 PRACTI – A Replication Toolkit Partial Replication Arbitrary Consistency Topology Independence Coda, Sprite Bayou, TACT Ficus, Pangaea PRACTI

69 PRACTI Design Core: Mechanism Controller: Policy  Notified of key events Read Miss, update arrival, invalidation arrival, …  Directs communication across cores Controller Core Inform Mgmt. read() write() delete() Invals & Updates from/to other nodes

70 SDIMS Controller in PRACTI Read Miss: For locating a replica  Similar to “File Location System” example  But handles flash crowds Dissemination tree among requesting clients For Writes: Spanning trees among replicas  Multicast tree for spreading invalidations  Different trees for different objects

71 PRACTI – Grid Benchmark Three phases  Read input and programs  Compute (some pairwise reads)  Results back to server Performance improvement:  21% reduction in total time Home Grid at school

72 PRACTI Experience Aggregation abstraction and API generality  Construct multicast trees for pushing invalidations  Locate a replica on a local read miss Construct a tree in the case of flash crowds Performance benefits  Grid micro-benchmark: 21% improvement over manual tree construction Ease of implementation  Less than two weeks

73 Conclusions Research Vision  Ease design and development of distributed services SDIMS – an information management middleware  Scalability with both machines and attributes An order of magnitude lower maximum node stress  Flexibility in aggregation strategies Support for a wide range of applications  Autonomy  Robustness to failures

74 Future Directions Core SDIMS research  Composite queries  Resilience to temporary reconfigurations  Probe functions Other components of wide-area distributed OS  Scheduling  Data management  Monitoring  … Security Monitoring Data Management Scheduling...... Information Management

75 For more information: http://www.cs.utexas.edu/users/ypraveen/sdims

76 SkipNet and Autonomy Constrained load balancing in Skipnet Also single level administrative domains One solution: Maintain separate rings in different domains ece.utexas.edu cs.utexas.edu phy.utexas.edu Does not form trees because of revisits

77 Load Balance Let  f = fraction of attributes a node is interested in  N = number of nodes in the system In DHT -- node will have O(log (N)) indegree whp

78 Related Work Other aggregation systems  Astrolabe, SOMO, Dasis, IrisNet Single tree  Cone Aggregation tree changes with new updates  Ganglia, TAG, Sophia, and IBM Tivoli Monitoring System Database abstraction on DHTs  PIER and Gribble et al 2001  Support for “join” operation Can be leveraged for answering composite queries

79 Load Balance How many attributes? O(log N) levels Few children at each level Each node interested in few attributes Level 0: d Level 1: 2 x d / 2 = d Level 2: 2 * (d) / 2 = c^2 * d/4 … Total = d * [ 1+ c/2+c^2/4+……] = O(d * log N)

80 PRACTI – Approach Bayou type log-exchange  But allow partial replication Two key ideas  Separate invalidations from updates  partial replication of data  Imprecise invalidations: summary of a set of invals  partial replication of metadata

81 PRACTI For reads – Locate a replica on a read miss For writes – Construct spanning tree among replicas  To propagate invalidations  To propagate updates ControllerCore

82 SDIMS not yet another DHT system Typical DHT applications  Use put and get interfaces in hashtable Aggregation as a general abstraction

83 Autonomy Increase in path length Path Convergence violations  None in autonomous DHT bf=4 bf=16 bf=64 Pastry bf=4 bf=16 bf=64 Pastry ADHT bf = branching factor or nodes per domain

84 Autonomy Increase in path length Path Convergence violations  None in autonomous DHT bf=4 bf=16 bf=64 Pastry bf=4 bf=16 bf=64 Pastry ADHT bf = branching factor or nodes per domain bf   tree height  bf   #violations 

85 Robustness  Planet-Lab with 67 nodes Aggregation function: summation; Strategy: Update-Up Each node updates the attribute with value 10

86 Sparse attributes Attributes of interest to only few nodes  Example: A file “foo” in file location application  Key for scalability Challenge:  Aggregation abstraction – one function per attribute  Dilemma Separate aggregation function with each attribute  Unnecessary storage and communication overheads A vector of values with one aggregation function  Defeats DHT advantage

87 Sparse attributes Attributes of interest to only few nodes  Example: A file “foo” in file location application  Key for scalability Challenge:  Aggregation abstraction – one function per attribute  Dilemma Separate aggregation function with each attribute  Unnecessary storage and communication overheads A vector of values with one aggregation function  Defeats DHT advantage AttributeFunctionValue fileFooAggrFuncFileFoomacID fileBarAggrFuncFileBarmacID …………………

88 Sparse attributes Attributes of interest to only few nodes  Example: A file “foo” in file location application  Key for scalability Challenge:  Aggregation abstraction – one function per attribute  Dilemma Separate aggregation function with each attribute  Unnecessary storage and communication overheads A vector of values with one aggregation function  Defeats DHT advantage AttributeFunctionValue fileAggrFuncFileLoc (“foo”, “bar”, ……)

89 Novel Aggregation Abstraction Separate attribute type from attribute name  Attribute = (attribute type, attribute name)  Example: type=“fileLocation”, name=“fileFoo” Define aggregation function for a type Attr. TypeAttr. NameValue fileLocationfileFoo1.1.1.1 fileLocationfileBar1.1.1.1 MINcpuLoad(macA, 0.3) multicastmcastSess1yes IP addr: 1.1.1.1 Name: macA Attr. TypeAggr Function fileLocationSELECT_ONE MIN multicastMULTICAST

90 Example – File Location Attribute: “fileFoo”  Value at a machine with id machineId machineId if file “Foo” exists on the machine null otherwise Aggregation function  SELECT_TWO (set of machine ids) B, C B C null B C fileFoo Query: Tell me two machines with file “Foo”.

91 A Key Component Most large-scale distributed applications  Monitor, query and react to changes in the system  Examples: Fundamental building block Information collection and management System administration and management Service placement and location Sensor monitoring and control Distributed Denial-of-Service attack detection File location service Multicast tree construction Naming and request routing …………

92 CS Department Micro-benchmark Experiment

93 API Exposed to Applications API Applications Install (attrType, function, up, down) Update (attrType, attrName, Value) Probe (attrType, attrName, level, mode) TypeNameValue MINminLoad(A, 0.3) fileLocationfileFoo1.1.1.1 fileLocationfileBar1.1.1.1 TypeFunction minMIN fileLocationSELECT-ONE SDIMS at leaf node (level = 0)

1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas.

Similar presentations

Presentation on theme: "1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas.

Similar presentations

Presentation on theme: "1 A Scalable Information Management Middleware for Large Distributed Systems Praveen Yalagandula HP Labs, Palo Alto Mike Dahlin, The University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback