1 High-Availability in Scalable Distributed Data Structures W. Litwin.

1 High-Availability in Scalable Distributed Data Structures Witold.Litwin@dauphine.fr Witold.Litwin@dauphine.fr W. Litwin

2 PlanPlan n What are SDDSs ? n High-Availability SDDSs n LH* with scalable availability n Conclusion

3 MulticomputersMulticomputers n A collection of loosely coupled computers –common and/or preexisting hardware –share nothing architecture –message passing through high-speed net (  Mb/s) n Network multicomputers –use general purpose nets »LANs: Fast Ethernet, Token Ring, SCI, FDDI, Myrinet, ATM… –NCSA cluster : 512 NTs on Myrinet by the end of 1998 n Switched multicomputers –use a bus, or a switch –IBM-SP2, Parsytec...

4 Client Server Network multicomputer

5 Why multicomputers ? n Unbeatable price-performance ratio –Much cheaper and more powerful than supercomputers »especially the network multicomputers –1500 WSs at HPL with 500+ GB of RAM & TBs of disks n Computing power –file size, access and processing times, throughput... n For more pro & cons : –IBM SP2 and GPFS literature –Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995 –NOW project (UC Berkeley) –Bill Gates at Microsoft Scalability Day, May 1997 –www.microoft.com White Papers from Business Syst. Div. –Report to the President, President’s Inf. Techn. Adv. Comm., Aug 98

6 Why SDDSs n Multicomputers need data structures and file systems n Trivial extensions of traditional structures are not best  hot-spots  scalability  parallel queries  distributed and autonomous clients  distributed RAM & distance to data

7 What is an SDDS ? +Data are structured +records with keys  objects with an OID + more semantics than in Unix flat-file model + abstraction popular with applications + allows for parallel scans +function shipping +Data are on servers –always available for access +Overflowing servers split into new servers –appended to the file without informing the clients +Queries come from multiple autonomous clients –available for access only on their initiative »no synchronous updates on the clients +There is no centralized directory for access computations

8 +Clients can make addressing errors »Clients have less or more adequate image of the actual file structure, Servers are able to forward the queries to the correct address –perhaps in several messages + Servers may send Image Adjustment Messages »Clients do not make same error twice n See the SDDS talk for more on it – –192.134.119.81/witold.html n Or the LH* ACM-TODS paper (Dec. 96) What is an SDDS ?

9 An SDDS Clients growth through splits under inserts Servers

14 An SDDS Clients

15 Clients An SDDS

16 Clients IAM An SDDS

17 Clients An SDDS

18 Clients An SDDS

19 Known SDDSs DS Classics

20 Known SDDSs Hash SDDS (1993) LH* DDH Breitbart & al DS Classics

21 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer DS Classics

22 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees k-RP* dPi-tree DS Classics

23 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree

24 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer Breitbart & Vingralek m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree s-availability LH* SA LH* RS

25 LH* ( A classic) n Allows for the primary key (OID) based hash files –generalizes the LH addressing schema »variants used in Netscape products, LH-Server, Unify, Frontpage, IIS, MsExchange... n Typical load factor 70 - 90 % n In practice, at most 2 forwarding messages –regardless of the size of the file n In general, 1 m/insert and 2 m/search on the average n 4 messages in the worst case n Search time of 1 ms (10 Mb/s net), of 150  s (100 Mb/s net) and of 30  s (Gb/s net)

26 High-availability LH* schemes n In a large multicomputer, it is unlikely that all servers are up n Consider the probability that a bucket is up is 99 % –bucket is unavailable 3 days per year n If one stores every key in only 1 bucket –case of typical SDDSs, LH* included n Then file reliability : probability that n- bucket file is entirely up is: » 37 % for n = 100 »0 % for n = 1000 n Acceptable for yourself ?

27 High-availability LH* schemes n Using 2 buckets to store a key, one may expect the reliability of: –99 % for n = 100 –91 % for n = 1000 n High-availability files –make data available despite unavailability of some servers »RAIDx, LSA, EvenOdd, DATUM... n High-availability SDDS –make sense –are the only way to reliable large SDDS files

28 Known High-availability LH* schemes n Known high-availability LH* schemes keep data available under : –any single server failure (1-availability) –any n-server failure »n fixed or scalable (n-availability or scalable availability) –some n’-server failures ; n’ > n n Three principles for high-availability SDDS schemes are known –mirroring (LH*m) »storage doubles ; 1-availability –striping (LH*s) »affects parallel queries ; 1-availability –grouping (LH* g, LH* SA, LH *RS )

29 Scalable Availability n n-availability –availability of all data despite simultaneous unavailability of up to any n buckets »E.g., RAIDx, EvenOdd, RAV, DATUM... n Reliability –probability P that all the records are available n Problem –For every choice of n, P  0 when the file scales. n Solution –Scalable availability »n grows with the file size, to regulate P –Constraint »the growth has to be incremental

30 LH*sa file (Litwin, Menon, Risch) n An LH* file with data buckets for data records –provided in addition with the availability level i in each bucket n One or more parity files with parity buckets for parity records –added when the file scales –with every bucket 0 mirroring the LH*sa file state data (i, n) n A family of grouping functions group data buckets into groups of size k > 1 such that: –Every two buckets in the same group i are in different groups i’  i n There is one parity bucket per data bucket group n Within a parity bucket, a parity record is maintained for up to k data records with the same rank

31 LH*sa File Expansion (k = 2)

32 Scalable Availability (basic schema) n Up to k data buckets, use 1-st level grouping –so there will be only one parity bucket n Then, start also 2-nd level grouping n When file exceeds k 2 buckets, start 3-rd level grouping n When file exceeds k 3 buckets, start 4-rd level grouping n Etc.

33 LH*sa groups

34 LH*sa File Expansion (k = 2)

35 LH*sa Recovery n Bucket or record unavailability is detected –by the client during the search or update –by forwarding server n Coordinator is alerted to perform the recovery –to bypass the unavailable bucket –or to restore the record on the fly –or the restore the bucket in a spare n The recovered record is delivered to the client

36 LH*sa Bucket & Record Recovery n Try the 1-st level group for the unavailable bucket m n If other buckets are found unavailable in this group –try to recover each of them using 2nd level groups n And so on… n Come back to recover finally bucket –See the paper for the full algorithms n For an I -available file, it is possible to sometimes recover a records even when more than I buckets in a group are unavailable

37 LH*sa normal recovery

38 Good case recovery (I = 3)

39 Scalability Analysis Search Performance n Search –usually same cost as for LH* »inluding the parallel queries »2 messages per key search –in degraded mode: »usually O (k) –record reconstruction cost using 1-st level parity »worst case : O ((I+1) k)

40 Insert Performance n Usually : (I+1) or (I + 2) –(I+1) is the best possible value for every I-availability schema n In degraded mode: –about the same if the unavailable bucket can be bypassed –add the bucket recovery cost otherwise »the client cost is only a few messages to deliver the record to the coordinator

41 Split Performance n LH* split cost that is O (b/2) –b is bucket capacity –one message per record n Plus usually O (Ib) messages to parity buckets –to recompute (XOR) parity bits since usually all records get new ranks n Plus O (b) messages when new bucket is created

42 Storage Performance n Storage overhead cost C s = S’ / S –S’ - storage for parity files –S - storage for data buckets »practically, the LH* file storage cost n C s depends on the file availability level I reached n To build new level to I + 1: –C s starts from lower bound L I = I / k »for file size M = k I »the best possible value for any I-availability schema –Increases towards an upper bound U I + 1  O ( ½ + I /k) »as long as new splits add parity buckets –Decreases towards L I+1 afterwards

43Example

44ReliabilityReliability n Probability P that all records are available to the application –all the data buckets are available –or every record can be recovered »there is at most I buckets failed in an LH*sa group n Depends on –failure probability p of each site –group size k –file size M n Reliability of basic LH*sa schema is termed uncontrolled

45 Uncontrolled ReliabilityLH*saLH*sa

46 Controlled Reliability n To keep the reliability above or close to given threshold through –delaying or accelerating the availability level growth –or gracefully changing group size k n Necessary for higher values of p –case of less reliable sites »frequent situation on network multicomputers n May improve performance for small p’s. n Several schemes are possible

47 Controlled Reliability with Fixed Group Size p = 0.2 k = 4 T = 0.8

48 Controlled Reliability with Variable Group Size p = 0.01 T = 0.95

49 LH* RS (Litwin & Schwarz) n Single grouping function –1234, 5678… n Multiple parity buckets per group n Scalable availability –1 parity bucket per group until 2 i 1 buckets –Then, at each split, add 2nd parity bucket to each existing group or create 2 parity buckets for new groups until 2 i 2 buckets –etc.

50 LH* RS File Expansion

53 LH* RS Parity Calculus n Choose GF(2 l ) –Typically GF (16) or GF (256) n Create the k x n generator matrix G –using elementary transformation of extended Vandermond matrix of GF elements –k is the records group size –n = 2 l is max segment size (data and parity records) –G = [I | P] –I denotes the identity matrix n Each record is a sequence of symbols from GF(2 l ) n The k symbols with the same offset in the records of a group become the (horizontal) information vector U n The matrix multiplication U G provides the (n - k) parity symbols, i.e., the codeword vector

54 LH* RS Parity Calculus n Parity calculus is distributed to parity buckets –each column is at one bucket n Parity is calculated only for existing data and parity buckets –At each insert, delete and update n Adding new parity buckets does not change existing parity records

55 Example: GF(4) Addition : XOR Multiplication : direct table or log / antilog tables

56EncodingEncoding Records :

57EncodingEncoding 01 01 01 01 00 Codewords Records :

58EncodingEncoding 01 01 01 01 00 00 00 00 00 00... …......... Codewords Records :

59 LH* RS Recovery Calculus n Performed when at most n - k buckets are unavailable, among the data and the parity buckets of a group : n Choose k available buckets n Form the submatrix H of G from the corresponding columns n Invert this matrix into matrix H -1 n Multiply the horizontal vector S of available symbols with the same offset by H -1 n The result contains the recovered data and/or parity symbols

60ExampleExample Buckets

61ExampleExample Buckets

62ExampleExample 01 01 01 Recovered symbols / buckets Buckets

63RecoveryRecovery 01 01 01 00 00 00 … …... Recovered symbols / buckets Buckets

64ConclusionConclusion n High-availability is an important property of an SDDS n Its design should preserve the scalability, parallelism & reliability n Schemes using record grouping seem most appropriate

65 Future Work n Performance analysis of LH* RS n Implementation of any of high-availability SDDSs –LH* RS is now implemented at CERIA by Mattias Ljungström n High-availability variants of other known SDDSs

66 EndEnd Witold Litwin witold.litwin@dauphine.fr Thank you for your attention

1 High-Availability in Scalable Distributed Data Structures W. Litwin.

Similar presentations

Presentation on theme: "1 High-Availability in Scalable Distributed Data Structures W. Litwin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 High-Availability in Scalable Distributed Data Structures W. Litwin.

Similar presentations

Presentation on theme: "1 High-Availability in Scalable Distributed Data Structures W. Litwin."— Presentation transcript:

Similar presentations

About project

Feedback