Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 High-Availability in Scalable Distributed Data Structures W. Litwin.

Similar presentations


Presentation on theme: "1 High-Availability in Scalable Distributed Data Structures W. Litwin."— Presentation transcript:

1

2 1 High-Availability in Scalable Distributed Data Structures Witold.Litwin@dauphine.fr Witold.Litwin@dauphine.fr W. Litwin

3 2 PlanPlan n What are SDDSs ? n High-Availability SDDSs n LH* with scalable availability n Conclusion

4 3 MulticomputersMulticomputers n A collection of loosely coupled computers –common and/or preexisting hardware –share nothing architecture –message passing through high-speed net (  Mb/s) n Network multicomputers –use general purpose nets »LANs: Fast Ethernet, Token Ring, SCI, FDDI, Myrinet, ATM… –NCSA cluster : 512 NTs on Myrinet by the end of 1998 n Switched multicomputers –use a bus, or a switch –IBM-SP2, Parsytec...

5 4 Client Server Network multicomputer

6 5 Why multicomputers ? n Unbeatable price-performance ratio –Much cheaper and more powerful than supercomputers »especially the network multicomputers –1500 WSs at HPL with 500+ GB of RAM & TBs of disks n Computing power –file size, access and processing times, throughput... n For more pro & cons : –IBM SP2 and GPFS literature –Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995 –NOW project (UC Berkeley) –Bill Gates at Microsoft Scalability Day, May 1997 –www.microoft.com White Papers from Business Syst. Div. –Report to the President, President’s Inf. Techn. Adv. Comm., Aug 98

7 6 Why SDDSs n Multicomputers need data structures and file systems n Trivial extensions of traditional structures are not best  hot-spots  scalability  parallel queries  distributed and autonomous clients  distributed RAM & distance to data

8 7 What is an SDDS ? +Data are structured +records with keys  objects with an OID + more semantics than in Unix flat-file model + abstraction popular with applications + allows for parallel scans +function shipping +Data are on servers –always available for access +Overflowing servers split into new servers –appended to the file without informing the clients +Queries come from multiple autonomous clients –available for access only on their initiative »no synchronous updates on the clients +There is no centralized directory for access computations

9 8 +Clients can make addressing errors »Clients have less or more adequate image of the actual file structure, Servers are able to forward the queries to the correct address –perhaps in several messages + Servers may send Image Adjustment Messages »Clients do not make same error twice n See the SDDS talk for more on it – –192.134.119.81/witold.html n Or the LH* ACM-TODS paper (Dec. 96) What is an SDDS ?

10 9 An SDDS Clients growth through splits under inserts Servers

11 10 An SDDS Clients growth through splits under inserts Servers

12 11 An SDDS Clients growth through splits under inserts Servers

13 12 An SDDS Clients growth through splits under inserts Servers

14 13 An SDDS Clients growth through splits under inserts Servers

15 14 An SDDS Clients

16 15 Clients An SDDS

17 16 Clients IAM An SDDS

18 17 Clients An SDDS

19 18 Clients An SDDS

20 19 Known SDDSs DS Classics

21 20 Known SDDSs Hash SDDS (1993) LH* DDH Breitbart & al DS Classics

22 21 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer DS Classics

23 22 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees k-RP* dPi-tree DS Classics

24 23 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree

25 24 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer Breitbart & Vingralek m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree s-availability LH* SA LH* RS

26 25 LH* ( A classic) n Allows for the primary key (OID) based hash files –generalizes the LH addressing schema »variants used in Netscape products, LH-Server, Unify, Frontpage, IIS, MsExchange... n Typical load factor 70 - 90 % n In practice, at most 2 forwarding messages –regardless of the size of the file n In general, 1 m/insert and 2 m/search on the average n 4 messages in the worst case n Search time of 1 ms (10 Mb/s net), of 150  s (100 Mb/s net) and of 30  s (Gb/s net)

27 26 High-availability LH* schemes n In a large multicomputer, it is unlikely that all servers are up n Consider the probability that a bucket is up is 99 % –bucket is unavailable 3 days per year n If one stores every key in only 1 bucket –case of typical SDDSs, LH* included n Then file reliability : probability that n- bucket file is entirely up is: » 37 % for n = 100 »0 % for n = 1000 n Acceptable for yourself ?

28 27 High-availability LH* schemes n Using 2 buckets to store a key, one may expect the reliability of: –99 % for n = 100 –91 % for n = 1000 n High-availability files –make data available despite unavailability of some servers »RAIDx, LSA, EvenOdd, DATUM... n High-availability SDDS –make sense –are the only way to reliable large SDDS files

29 28 Known High-availability LH* schemes n Known high-availability LH* schemes keep data available under : –any single server failure (1-availability) –any n-server failure »n fixed or scalable (n-availability or scalable availability) –some n’-server failures ; n’ > n n Three principles for high-availability SDDS schemes are known –mirroring (LH*m) »storage doubles ; 1-availability –striping (LH*s) »affects parallel queries ; 1-availability –grouping (LH* g, LH* SA, LH *RS )

30 29 Scalable Availability n n-availability –availability of all data despite simultaneous unavailability of up to any n buckets »E.g., RAIDx, EvenOdd, RAV, DATUM... n Reliability –probability P that all the records are available n Problem –For every choice of n, P  0 when the file scales. n Solution –Scalable availability »n grows with the file size, to regulate P –Constraint »the growth has to be incremental

31 30 LH*sa file (Litwin, Menon, Risch) n An LH* file with data buckets for data records –provided in addition with the availability level i in each bucket n One or more parity files with parity buckets for parity records –added when the file scales –with every bucket 0 mirroring the LH*sa file state data (i, n) n A family of grouping functions group data buckets into groups of size k > 1 such that: –Every two buckets in the same group i are in different groups i’  i n There is one parity bucket per data bucket group n Within a parity bucket, a parity record is maintained for up to k data records with the same rank

32 31 LH*sa File Expansion (k = 2)

33 32 Scalable Availability (basic schema) n Up to k data buckets, use 1-st level grouping –so there will be only one parity bucket n Then, start also 2-nd level grouping n When file exceeds k 2 buckets, start 3-rd level grouping n When file exceeds k 3 buckets, start 4-rd level grouping n Etc.

34 33 LH*sa groups

35 34 LH*sa File Expansion (k = 2)

36 35 LH*sa Recovery n Bucket or record unavailability is detected –by the client during the search or update –by forwarding server n Coordinator is alerted to perform the recovery –to bypass the unavailable bucket –or to restore the record on the fly –or the restore the bucket in a spare n The recovered record is delivered to the client

37 36 LH*sa Bucket & Record Recovery n Try the 1-st level group for the unavailable bucket m n If other buckets are found unavailable in this group –try to recover each of them using 2nd level groups n And so on… n Come back to recover finally bucket –See the paper for the full algorithms n For an I -available file, it is possible to sometimes recover a records even when more than I buckets in a group are unavailable

38 37 LH*sa normal recovery

39 38 Good case recovery (I = 3)

40 39 Scalability Analysis Search Performance n Search –usually same cost as for LH* »inluding the parallel queries »2 messages per key search –in degraded mode: »usually O (k) –record reconstruction cost using 1-st level parity »worst case : O ((I+1) k)

41 40 Insert Performance n Usually : (I+1) or (I + 2) –(I+1) is the best possible value for every I-availability schema n In degraded mode: –about the same if the unavailable bucket can be bypassed –add the bucket recovery cost otherwise »the client cost is only a few messages to deliver the record to the coordinator

42 41 Split Performance n LH* split cost that is O (b/2) –b is bucket capacity –one message per record n Plus usually O (Ib) messages to parity buckets –to recompute (XOR) parity bits since usually all records get new ranks n Plus O (b) messages when new bucket is created

43 42 Storage Performance n Storage overhead cost C s = S’ / S –S’ - storage for parity files –S - storage for data buckets »practically, the LH* file storage cost n C s depends on the file availability level I reached n To build new level to I + 1: –C s starts from lower bound L I = I / k »for file size M = k I »the best possible value for any I-availability schema –Increases towards an upper bound U I + 1  O ( ½ + I /k) »as long as new splits add parity buckets –Decreases towards L I+1 afterwards

44 43Example

45 44ReliabilityReliability n Probability P that all records are available to the application –all the data buckets are available –or every record can be recovered »there is at most I buckets failed in an LH*sa group n Depends on –failure probability p of each site –group size k –file size M n Reliability of basic LH*sa schema is termed uncontrolled

46 45 Uncontrolled ReliabilityLH*saLH*sa

47 46 Controlled Reliability n To keep the reliability above or close to given threshold through –delaying or accelerating the availability level growth –or gracefully changing group size k n Necessary for higher values of p –case of less reliable sites »frequent situation on network multicomputers n May improve performance for small p’s. n Several schemes are possible

48 47 Controlled Reliability with Fixed Group Size p = 0.2 k = 4 T = 0.8

49 48 Controlled Reliability with Variable Group Size p = 0.01 T = 0.95

50 49 LH* RS (Litwin & Schwarz) n Single grouping function –1234, 5678… n Multiple parity buckets per group n Scalable availability –1 parity bucket per group until 2 i 1 buckets –Then, at each split, add 2nd parity bucket to each existing group or create 2 parity buckets for new groups until 2 i 2 buckets –etc.

51 50 LH* RS File Expansion

52 51 LH* RS File Expansion

53 52 LH* RS File Expansion

54 53 LH* RS Parity Calculus n Choose GF(2 l ) –Typically GF (16) or GF (256) n Create the k x n generator matrix G –using elementary transformation of extended Vandermond matrix of GF elements –k is the records group size –n = 2 l is max segment size (data and parity records) –G = [I | P] –I denotes the identity matrix n Each record is a sequence of symbols from GF(2 l ) n The k symbols with the same offset in the records of a group become the (horizontal) information vector U n The matrix multiplication U G provides the (n - k) parity symbols, i.e., the codeword vector

55 54 LH* RS Parity Calculus n Parity calculus is distributed to parity buckets –each column is at one bucket n Parity is calculated only for existing data and parity buckets –At each insert, delete and update n Adding new parity buckets does not change existing parity records

56 55 Example: GF(4) Addition : XOR Multiplication : direct table or log / antilog tables

57 56EncodingEncoding Records :

58 57EncodingEncoding 01 01 01 01 00 Codewords Records :

59 58EncodingEncoding 01 01 01 01 00 00 00 00 00 00... …......... Codewords Records :

60 59 LH* RS Recovery Calculus n Performed when at most n - k buckets are unavailable, among the data and the parity buckets of a group : n Choose k available buckets n Form the submatrix H of G from the corresponding columns n Invert this matrix into matrix H -1 n Multiply the horizontal vector S of available symbols with the same offset by H -1 n The result contains the recovered data and/or parity symbols

61 60ExampleExample Buckets

62 61ExampleExample Buckets

63 62ExampleExample 01 01 01 Recovered symbols / buckets Buckets

64 63RecoveryRecovery 01 01 01 00 00 00 … …... Recovered symbols / buckets Buckets

65 64ConclusionConclusion n High-availability is an important property of an SDDS n Its design should preserve the scalability, parallelism & reliability n Schemes using record grouping seem most appropriate

66 65 Future Work n Performance analysis of LH* RS n Implementation of any of high-availability SDDSs –LH* RS is now implemented at CERIA by Mattias Ljungström n High-availability variants of other known SDDSs

67 66 EndEnd Witold Litwin witold.litwin@dauphine.fr Thank you for your attention

68 67


Download ppt "1 High-Availability in Scalable Distributed Data Structures W. Litwin."

Similar presentations


Ads by Google