MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh.

MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh

Haifeng Yu, Intel Research Pittsburgh / CMU2 Background  Availability has become principle design goal:  0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com]  One major focus of 8 OSDI’04 papers (out of 27)  Two orthogonal efforts:  Lower-level system components robustness Example: disk, individual machine, Internet routing  Higher-level redundancy Example: data replication  This talk focuses on higher-level redundancy

Haifeng Yu, Intel Research Pittsburgh / CMU3 High Availability via Replication  Large amount of data accessed by many users:  Distributed file systems  Network monitoring (PIER, SDIMS, IRISLOG)  Index databases for search engine (Google, p2p)  Scientific / medical databases  Data replicated across multiple machines  Object: The unit for replication File, file block, database table, database tuple, inverted index for a certain keyword

Haifeng Yu, Intel Research Pittsburgh / CMU4 Multi-object Accesses  Many accesses request multiple objects  Compile a project  Writing a paper under Latex  Asking for aggregates of network conditions  Search for web pages containing multiple keywords  Availability of single object can be misleading:  An access requesting 1,000 objects can observe up to 1,000 times higher unavailability  There’s more subtlety.....

Haifeng Yu, Intel Research Pittsburgh / CMU5 A Simple Example  Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D  Four machines fail independently with same prob, each holds two file  Which assignment gives better avail: A BC D A BC D or A BC D A CB D Better Assignment matters because objects are now correlated

Haifeng Yu, Intel Research Pittsburgh / CMU6 A Simple Example - Continued  Suppose user is happy even if only three objects are available (e.g., when computing average) A BC D A BC D or A BC D A CB D Better  Assignment makes a difference  Even if we are using the same machines (same amount of redundancy/resource)  Easily have multiple-nine difference

Haifeng Yu, Intel Research Pittsburgh / CMU7 Goal and Contributions  MOAT (Multi-Object Assignment Toolkit):  Goal: High availability for multi-object accesses  Key issue: Replica assignment  Contributions:  First to observe the importance of replica assignment  Strong theoretical results regarding best and worst assignments  Practical designs to approximate optimal assignments  MOAT toolkit implementation for replica assignments

Haifeng Yu, Intel Research Pittsburgh / CMU8 Outline  Motivation and MOAT contributions   System model and case studies of existing systems  Theoretical results  Designs for approximating optimal assignments  Designs for mixed accesses  Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU9 Assumptions for This Talk  Assume:  Replication (no erasure coding)  Crash failures (no Byzantine failures)  Eventual consistency (no quorum or voting)  Most of our results hold without these assumptions  Assume same replication degree for all objects  We have results for different replication degrees as well  Talk to me if interested in the more complete story...

Haifeng Yu, Intel Research Pittsburgh / CMU10 MOAT Architecture Overview MOAT raw data on distributed machines or disks file system network monitoring p2p DB search engine Storage System App replication / repair / load balancing / naming / assignment Data API obj create / delete / read / write Control API assignment policy

Haifeng Yu, Intel Research Pittsburgh / CMU11 System Model  Basic system model:  N objects, each with k replicas  Load balancing among all machines  Machines fail independently with same prob  An assignment is a mapping: replica  machine, for all N  k replicas A BC D A BC D

Haifeng Yu, Intel Research Pittsburgh / CMU12 Some Simple Assignments  PTN: partition assignment  Used in most practice of Coda [Satyanarayanan et al.’90] A B CD E F A B CD E F for k = 2...........  RAND: pick a random replica each time  Similar as in Google File System [Ghemawat et al.’03]

Haifeng Yu, Intel Research Pittsburgh / CMU13 Assignment in Chord [Stoica et al.’01]  DHTs:  Hash machine IP to get machine id  Assignment in Chord:  Sliding window  Neither PTN nor RAND 101 120 104 098 090 080 A A C C B hash(A) = 95 C B B

Haifeng Yu, Intel Research Pittsburgh / CMU14 Assignment in CAN [Ratnasamy et al.’01]  Hash object k times  CAN uses a similar approach  Similar as RAND  But machines may have slightly different number of objects 101 120 104 098 090 080 A hash1(A) = 95

Haifeng Yu, Intel Research Pittsburgh / CMU15 Assignment in CAN [Ratnasamy et al.’01] 101 120 104 098 090 080 A A hash2(A) = 119  Hash object k times  CAN uses a similar approach  Similar as RAND  But machines may have slightly different number of objects

Haifeng Yu, Intel Research Pittsburgh / CMU16 Assignment in CAN [Ratnasamy et al.’01] 101 120 104 098 090 080 A A hash1(B) = 84 hash2(B) = 100 B B  Hash object k times  CAN uses a similar approach  Similar as RAND  But machines may have slightly different number of objects

Haifeng Yu, Intel Research Pittsburgh / CMU17 Which assignment should we use?  MOAT Goal: Improve avail of multi-object accesses  If an access requests n (n  N) objects, what if only x are available?  Threshold-based success definition:  If x ≥ t, user happy  Available  If x < t, too low confidence  Unavailable  Availability for an access defined as:  Prob[  t objects available out of n requested objects]

Haifeng Yu, Intel Research Pittsburgh / CMU18 Examples of t  t = n  File systems  Search for terrorist images in image database  t close n  Query for top-10 most-loaded machines on PlanetLab  t not close n  Sample with confidence

Haifeng Yu, Intel Research Pittsburgh / CMU19 Outline  Motivation and MOAT contributions   System model and case studies of existing systems   Theoretical results  Designs for approximating optimal assignments  Designs for mixed accesses  Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU20 Formal Results  For access requesting N objects  Theorem: Among all assignments, when t = N:  PTN is best (within constant)  RAND is worst (within constant)  Difference is about c folds (c is #obj / machine)  Theorem: Among all assignments, when t = c+1 < N:  PTN is worst  RAND is best (within constant)  Difference is even larger

Haifeng Yu, Intel Research Pittsburgh / CMU21 Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) PTN Chord c times difference if p is small, where c is # obj/machine unavail of single obj

Haifeng Yu, Intel Research Pittsburgh / CMU22 A Spectrum of Assignments 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) PTN

Haifeng Yu, Intel Research Pittsburgh / CMU23 More Formal Arguments  Tradeoff is fundamental:  Impossible to achieve the best of RAND and PTN  Previous results only for access requesting N objects  Similar results hold for accesses requesting n (n  N) objects  But each machine may not be filled to capacity: For PTN, use as few machines as possible For RAND, use as many machines as possible  I have more....talk to me if you are interested

Haifeng Yu, Intel Research Pittsburgh / CMU24 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) PTN Chord Access Requesting 500 Objects

Haifeng Yu, Intel Research Pittsburgh / CMU25 Outline  Motivation and MOAT contributions   System model and case studies of existing systems   Theoretical results   Designs for approximating optimal assignments  Designs for mixed accesses  Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU26 Design of Replica Assignment  Trivial in a static / centralized environment  Challenging in dynamic environment:  We may not have global knowledge with many objects and many machines  Basic solution: Consistent hashing  But some re-design is necessary

Haifeng Yu, Intel Research Pittsburgh / CMU27 Approximating RAND  Multi-hash DHT:  Hash the object k times  As in CAN 101 120 104 098 090 080 A A hash1(B) = 84 hash2(B) = 100 B B

Haifeng Yu, Intel Research Pittsburgh / CMU28 Approximating PTN  Chord does not achieve PTN 101 120 104 098 090 080 A B C A B C B hash(A) = 95 C

Haifeng Yu, Intel Research Pittsburgh / CMU29 Approximating PTN  Chord does not achieve PTN  Group DHT:  (Arbitrarily) group machine into groups of k size 120 A B C B hash(A) = 95 C 101 090 A B C

Haifeng Yu, Intel Research Pittsburgh / CMU30 Node Join and Leave in Group DHT  Maintain r rondevour points in DHT  Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04]  New node reports to a random rondevour point  If group can be formed, join DHT  Two options upon node leave:  Dismiss group and delete the group from DHT  The group wait to recruit a new node  Groups use rondevour point to decide

Haifeng Yu, Intel Research Pittsburgh / CMU31 Complexity Analysis MetricStandard DHTGroup DHT Routing statelog Nlog N/k Routing hopslog Nlog N/k Messages / Join(log N)^2log N/k + (log N/k)^2 / k Messages / Leave(Log N)^2log N/k + (log N/k)^2 / k Obj moves / Joink/N Obj moves / Leavek/N2k/N

Haifeng Yu, Intel Research Pittsburgh / CMU32 Outline  Motivation and MOAT contributions   System model and case studies of existing systems   Theoretical results   Designs for approximating optimal assignments   Designs for mixed accesses  Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU33 Mixture of Queries  Previous design only for single access requesting all N objects  PTN if t close to N  RAND if t far from N  But there are other accesses  Requests n (n < N) objects with threshold t  How does t change with n ?  Infinite possibilities  We focus on 4 large categories

Haifeng Yu, Intel Research Pittsburgh / CMU34 Four Application Scenarios Scenariosmall accesses (small n)large accesses (large n) File systemstrict Computing aggregates loose Network monitoring strict (pinpoint problems) loose (overview query) Image database search loose (resource retrieval of frequent objects -- E.g., find clip art for slide) strict (non-existence test -- E.g., exhaustive search of terrorist) Strict accesses: t  n Loose accesses: t < n

Haifeng Yu, Intel Research Pittsburgh / CMU35 Loose for both small and large n  Goal:  Approach RAND for both small and large n  Design:  Multi-hash DHT 101 120 104 098 090 080 A A hash1(B) = 84 hash2(B) = 100 B B

Haifeng Yu, Intel Research Pittsburgh / CMU36 Loose for small n; Strict for large n  Goal:  Approach RAND for small n  Approach PTN for large n  Design:  Group DHT 120 A B C A C 101 090 A B C

Haifeng Yu, Intel Research Pittsburgh / CMU37 Strict for both small and large n  Goal:  Approach PTN for both small and large n  Assume accesses are tree accesses  Design:  Group DHT with item-balancing [Karger et al.’04] 120 A B C B A = 95 101 090 A B C

Haifeng Yu, Intel Research Pittsburgh / CMU38 Strict for small n; Loose for large n  Goal:  Approaches PTN for n < R  Approaches RAND for n >> R  Design:  Multi-hash DHT  But cluster objects into clusters of constant size R 101 120 104 098 090 080 A A hash1(AB) = 84 hash2(AB) = 100 B B

Haifeng Yu, Intel Research Pittsburgh / CMU39 Simulation Results for Strict Accesses number (n) of objects requested by an access unavailability Here an access needs all n objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj

Haifeng Yu, Intel Research Pittsburgh / CMU40 Simulation Results for Loose Accesses number (n) of objects requested by an access unavailability Here an access needs only t = n - 150 objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj

Haifeng Yu, Intel Research Pittsburgh / CMU41 Current Status  Waiting for paper deadlines  Finishing implementing MOAT  Evaluation on IrisLog trace and file system traces

Haifeng Yu, Intel Research Pittsburgh / CMU42 Related Work  Multi-object accesses rarely addressed  CFS [Dabek et al.’01] focuses on individual file blocks  Chain replication [Renesse et al.’04] considers single data object  A long list.....  Replica assignment largely ignored  Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied  Replica placement [Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied:  Typically for machines in different locations in the network  Machines are heterogeneous  Approaches does not apply to replica assignment

Haifeng Yu, Intel Research Pittsburgh / CMU43 Conclusions  Availability becoming key design goal  Multi-object access availability dramatically different from single-object availability  MOAT Contributions:  First to observe the importance of replica assignment  Strong theoretical results regarding the best and worst assignments  Practical designs to approximate optimal assignments  MOAT toolkit implementation

Haifeng Yu, Intel Research Pittsburgh / CMU44 My Other Recent Work  Om [NSDI’04] :  Consistent and automatic replica regeneration  Regenerate from any single replica rather than a majority  Signed quorum systems [PODC’04] :  Constant quorum size at the cost of small prob of inconsistency  Node failure characteristics in WAN [WORLDS’04] :  Answer subtle questions regarding real-world failure properties

Haifeng Yu, Intel Research Pittsburgh / CMU45

Haifeng Yu, Intel Research Pittsburgh / CMU46 Erasure Coding  Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object  RAID techniques are special cases  Replication is a special case where m = 1

Haifeng Yu, Intel Research Pittsburgh / CMU47 Example Revisited  Need four files to compile: A BC D A BC D or A BC D A CB D Better  Erasure coding is hard to be applied across large amount of data  Updating any portion of data needs to update k - m + 1 fragments  the size of original data  We cannot use erasure coding across 1,000 files Can we treat A, B, C, D as a single obj and use erasure coding? So that all files can be reconstructed from any 4 out of 8 fragments

Haifeng Yu, Intel Research Pittsburgh / CMU48 Threshold Semantics and Erasure Coding Threshold SemanticsErasure Coding need t out of n objects to answer query need m out of k fragments to reconstruct object t determined by app semanticsm determined at coding time result dependent on which t objects same result regardless of which m fragments may update single object by itselfmodification to any portion of the object needs to update k-m+1 fragments In short, they are different, orthogonal concepts

Haifeng Yu, Intel Research Pittsburgh / CMU49 Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) CRAND (10) CRAND (100) PTN Chord c times difference if p is small, where c is # obj/machine

MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh.

Similar presentations

Presentation on theme: "MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh.

Similar presentations

Presentation on theme: "MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh."— Presentation transcript:

Similar presentations

About project

Feedback