Download presentation
Presentation is loading. Please wait.
Published byEduardo Burkinshaw Modified over 10 years ago
1
MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh
2
Haifeng Yu, Intel Research Pittsburgh / CMU2 Background Availability has become principle design goal: 0.1% improvement $2M / year for Amazon and Ebay [internetweek.com] One major focus of 8 OSDI’04 papers (out of 27) Two orthogonal efforts: Lower-level system components robustness Example: disk, individual machine, Internet routing Higher-level redundancy Example: data replication This talk focuses on higher-level redundancy
3
Haifeng Yu, Intel Research Pittsburgh / CMU3 High Availability via Replication Large amount of data accessed by many users: Distributed file systems Network monitoring (PIER, SDIMS, IRISLOG) Index databases for search engine (Google, p2p) Scientific / medical databases Data replicated across multiple machines Object: The unit for replication File, file block, database table, database tuple, inverted index for a certain keyword
4
Haifeng Yu, Intel Research Pittsburgh / CMU4 Multi-object Accesses Many accesses request multiple objects Compile a project Writing a paper under Latex Asking for aggregates of network conditions Search for web pages containing multiple keywords Availability of single object can be misleading: An access requesting 1,000 objects can observe up to 1,000 times higher unavailability There’s more subtlety.....
5
Haifeng Yu, Intel Research Pittsburgh / CMU5 A Simple Example Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D Four machines fail independently with same prob, each holds two file Which assignment gives better avail: A BC D A BC D or A BC D A CB D Better Assignment matters because objects are now correlated
6
Haifeng Yu, Intel Research Pittsburgh / CMU6 A Simple Example - Continued Suppose user is happy even if only three objects are available (e.g., when computing average) A BC D A BC D or A BC D A CB D Better Assignment makes a difference Even if we are using the same machines (same amount of redundancy/resource) Easily have multiple-nine difference
7
Haifeng Yu, Intel Research Pittsburgh / CMU7 Goal and Contributions MOAT (Multi-Object Assignment Toolkit): Goal: High availability for multi-object accesses Key issue: Replica assignment Contributions: First to observe the importance of replica assignment Strong theoretical results regarding best and worst assignments Practical designs to approximate optimal assignments MOAT toolkit implementation for replica assignments
8
Haifeng Yu, Intel Research Pittsburgh / CMU8 Outline Motivation and MOAT contributions System model and case studies of existing systems Theoretical results Designs for approximating optimal assignments Designs for mixed accesses Conclusions
9
Haifeng Yu, Intel Research Pittsburgh / CMU9 Assumptions for This Talk Assume: Replication (no erasure coding) Crash failures (no Byzantine failures) Eventual consistency (no quorum or voting) Most of our results hold without these assumptions Assume same replication degree for all objects We have results for different replication degrees as well Talk to me if interested in the more complete story...
10
Haifeng Yu, Intel Research Pittsburgh / CMU10 MOAT Architecture Overview MOAT raw data on distributed machines or disks file system network monitoring p2p DB search engine Storage System App replication / repair / load balancing / naming / assignment Data API obj create / delete / read / write Control API assignment policy
11
Haifeng Yu, Intel Research Pittsburgh / CMU11 System Model Basic system model: N objects, each with k replicas Load balancing among all machines Machines fail independently with same prob An assignment is a mapping: replica machine, for all N k replicas A BC D A BC D
12
Haifeng Yu, Intel Research Pittsburgh / CMU12 Some Simple Assignments PTN: partition assignment Used in most practice of Coda [Satyanarayanan et al.’90] A B CD E F A B CD E F for k = 2........... RAND: pick a random replica each time Similar as in Google File System [Ghemawat et al.’03]
13
Haifeng Yu, Intel Research Pittsburgh / CMU13 Assignment in Chord [Stoica et al.’01] DHTs: Hash machine IP to get machine id Assignment in Chord: Sliding window Neither PTN nor RAND 101 120 104 098 090 080 A A C C B hash(A) = 95 C B B
14
Haifeng Yu, Intel Research Pittsburgh / CMU14 Assignment in CAN [Ratnasamy et al.’01] Hash object k times CAN uses a similar approach Similar as RAND But machines may have slightly different number of objects 101 120 104 098 090 080 A hash1(A) = 95
15
Haifeng Yu, Intel Research Pittsburgh / CMU15 Assignment in CAN [Ratnasamy et al.’01] 101 120 104 098 090 080 A A hash2(A) = 119 Hash object k times CAN uses a similar approach Similar as RAND But machines may have slightly different number of objects
16
Haifeng Yu, Intel Research Pittsburgh / CMU16 Assignment in CAN [Ratnasamy et al.’01] 101 120 104 098 090 080 A A hash1(B) = 84 hash2(B) = 100 B B Hash object k times CAN uses a similar approach Similar as RAND But machines may have slightly different number of objects
17
Haifeng Yu, Intel Research Pittsburgh / CMU17 Which assignment should we use? MOAT Goal: Improve avail of multi-object accesses If an access requests n (n N) objects, what if only x are available? Threshold-based success definition: If x ≥ t, user happy Available If x < t, too low confidence Unavailable Availability for an access defined as: Prob[ t objects available out of n requested objects]
18
Haifeng Yu, Intel Research Pittsburgh / CMU18 Examples of t t = n File systems Search for terrorist images in image database t close n Query for top-10 most-loaded machines on PlanetLab t not close n Sample with confidence
19
Haifeng Yu, Intel Research Pittsburgh / CMU19 Outline Motivation and MOAT contributions System model and case studies of existing systems Theoretical results Designs for approximating optimal assignments Designs for mixed accesses Conclusions
20
Haifeng Yu, Intel Research Pittsburgh / CMU20 Formal Results For access requesting N objects Theorem: Among all assignments, when t = N: PTN is best (within constant) RAND is worst (within constant) Difference is about c folds (c is #obj / machine) Theorem: Among all assignments, when t = c+1 < N: PTN is worst RAND is best (within constant) Difference is even larger
21
Haifeng Yu, Intel Research Pittsburgh / CMU21 Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) PTN Chord c times difference if p is small, where c is # obj/machine unavail of single obj
22
Haifeng Yu, Intel Research Pittsburgh / CMU22 A Spectrum of Assignments 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) PTN
23
Haifeng Yu, Intel Research Pittsburgh / CMU23 More Formal Arguments Tradeoff is fundamental: Impossible to achieve the best of RAND and PTN Previous results only for access requesting N objects Similar results hold for accesses requesting n (n N) objects But each machine may not be filled to capacity: For PTN, use as few machines as possible For RAND, use as many machines as possible I have more....talk to me if you are interested
24
Haifeng Yu, Intel Research Pittsburgh / CMU24 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) PTN Chord Access Requesting 500 Objects
25
Haifeng Yu, Intel Research Pittsburgh / CMU25 Outline Motivation and MOAT contributions System model and case studies of existing systems Theoretical results Designs for approximating optimal assignments Designs for mixed accesses Conclusions
26
Haifeng Yu, Intel Research Pittsburgh / CMU26 Design of Replica Assignment Trivial in a static / centralized environment Challenging in dynamic environment: We may not have global knowledge with many objects and many machines Basic solution: Consistent hashing But some re-design is necessary
27
Haifeng Yu, Intel Research Pittsburgh / CMU27 Approximating RAND Multi-hash DHT: Hash the object k times As in CAN 101 120 104 098 090 080 A A hash1(B) = 84 hash2(B) = 100 B B
28
Haifeng Yu, Intel Research Pittsburgh / CMU28 Approximating PTN Chord does not achieve PTN 101 120 104 098 090 080 A B C A B C B hash(A) = 95 C
29
Haifeng Yu, Intel Research Pittsburgh / CMU29 Approximating PTN Chord does not achieve PTN Group DHT: (Arbitrarily) group machine into groups of k size 120 A B C B hash(A) = 95 C 101 090 A B C
30
Haifeng Yu, Intel Research Pittsburgh / CMU30 Node Join and Leave in Group DHT Maintain r rondevour points in DHT Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04] New node reports to a random rondevour point If group can be formed, join DHT Two options upon node leave: Dismiss group and delete the group from DHT The group wait to recruit a new node Groups use rondevour point to decide
31
Haifeng Yu, Intel Research Pittsburgh / CMU31 Complexity Analysis MetricStandard DHTGroup DHT Routing statelog Nlog N/k Routing hopslog Nlog N/k Messages / Join(log N)^2log N/k + (log N/k)^2 / k Messages / Leave(Log N)^2log N/k + (log N/k)^2 / k Obj moves / Joink/N Obj moves / Leavek/N2k/N
32
Haifeng Yu, Intel Research Pittsburgh / CMU32 Outline Motivation and MOAT contributions System model and case studies of existing systems Theoretical results Designs for approximating optimal assignments Designs for mixed accesses Conclusions
33
Haifeng Yu, Intel Research Pittsburgh / CMU33 Mixture of Queries Previous design only for single access requesting all N objects PTN if t close to N RAND if t far from N But there are other accesses Requests n (n < N) objects with threshold t How does t change with n ? Infinite possibilities We focus on 4 large categories
34
Haifeng Yu, Intel Research Pittsburgh / CMU34 Four Application Scenarios Scenariosmall accesses (small n)large accesses (large n) File systemstrict Computing aggregates loose Network monitoring strict (pinpoint problems) loose (overview query) Image database search loose (resource retrieval of frequent objects -- E.g., find clip art for slide) strict (non-existence test -- E.g., exhaustive search of terrorist) Strict accesses: t n Loose accesses: t < n
35
Haifeng Yu, Intel Research Pittsburgh / CMU35 Loose for both small and large n Goal: Approach RAND for both small and large n Design: Multi-hash DHT 101 120 104 098 090 080 A A hash1(B) = 84 hash2(B) = 100 B B
36
Haifeng Yu, Intel Research Pittsburgh / CMU36 Loose for small n; Strict for large n Goal: Approach RAND for small n Approach PTN for large n Design: Group DHT 120 A B C A C 101 090 A B C
37
Haifeng Yu, Intel Research Pittsburgh / CMU37 Strict for both small and large n Goal: Approach PTN for both small and large n Assume accesses are tree accesses Design: Group DHT with item-balancing [Karger et al.’04] 120 A B C B A = 95 101 090 A B C
38
Haifeng Yu, Intel Research Pittsburgh / CMU38 Strict for small n; Loose for large n Goal: Approaches PTN for n < R Approaches RAND for n >> R Design: Multi-hash DHT But cluster objects into clusters of constant size R 101 120 104 098 090 080 A A hash1(AB) = 84 hash2(AB) = 100 B B
39
Haifeng Yu, Intel Research Pittsburgh / CMU39 Simulation Results for Strict Accesses number (n) of objects requested by an access unavailability Here an access needs all n objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj
40
Haifeng Yu, Intel Research Pittsburgh / CMU40 Simulation Results for Loose Accesses number (n) of objects requested by an access unavailability Here an access needs only t = n - 150 objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj
41
Haifeng Yu, Intel Research Pittsburgh / CMU41 Current Status Waiting for paper deadlines Finishing implementing MOAT Evaluation on IrisLog trace and file system traces
42
Haifeng Yu, Intel Research Pittsburgh / CMU42 Related Work Multi-object accesses rarely addressed CFS [Dabek et al.’01] focuses on individual file blocks Chain replication [Renesse et al.’04] considers single data object A long list..... Replica assignment largely ignored Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied Replica placement [Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied: Typically for machines in different locations in the network Machines are heterogeneous Approaches does not apply to replica assignment
43
Haifeng Yu, Intel Research Pittsburgh / CMU43 Conclusions Availability becoming key design goal Multi-object access availability dramatically different from single-object availability MOAT Contributions: First to observe the importance of replica assignment Strong theoretical results regarding the best and worst assignments Practical designs to approximate optimal assignments MOAT toolkit implementation
44
Haifeng Yu, Intel Research Pittsburgh / CMU44 My Other Recent Work Om [NSDI’04] : Consistent and automatic replica regeneration Regenerate from any single replica rather than a majority Signed quorum systems [PODC’04] : Constant quorum size at the cost of small prob of inconsistency Node failure characteristics in WAN [WORLDS’04] : Answer subtle questions regarding real-world failure properties
45
Haifeng Yu, Intel Research Pittsburgh / CMU45
46
Haifeng Yu, Intel Research Pittsburgh / CMU46 Erasure Coding Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object RAID techniques are special cases Replication is a special case where m = 1
47
Haifeng Yu, Intel Research Pittsburgh / CMU47 Example Revisited Need four files to compile: A BC D A BC D or A BC D A CB D Better Erasure coding is hard to be applied across large amount of data Updating any portion of data needs to update k - m + 1 fragments the size of original data We cannot use erasure coding across 1,000 files Can we treat A, B, C, D as a single obj and use erasure coding? So that all files can be reconstructed from any 4 out of 8 fragments
48
Haifeng Yu, Intel Research Pittsburgh / CMU48 Threshold Semantics and Erasure Coding Threshold SemanticsErasure Coding need t out of n objects to answer query need m out of k fragments to reconstruct object t determined by app semanticsm determined at coding time result dependent on which t objects same result regardless of which m fragments may update single object by itselfmodification to any portion of the object needs to update k-m+1 fragments In short, they are different, orthogonal concepts
49
Haifeng Yu, Intel Research Pittsburgh / CMU49 Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 threshold unavailability RAND (CAN) CRAND (10) CRAND (100) PTN Chord c times difference if p is small, where c is # obj/machine
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.