1 High-Availability in Scalable Distributed Data Structures W. Litwin.

Slides:



Advertisements
Similar presentations
Where to leave the data ? – Parallel systems – Scalable Distributed Data Structures – Dynamic Hash Table (P2P)
Advertisements

External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
1 Mariposa system Witold Litwin. 2 Basic goals WAN oriented DDBS Multiple sites –e.g., 1000 Scalable Locally autonomous Easy to evolve.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Department of Computer Science and Engineering, HKUST Slide 1 Dynamic Hashing Good for database that grows and shrinks in size Allows the hash function.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Rim Moussa University Paris 9 Dauphine Experimental Performance Analysis of LH* RS Parity Management Workshop on Distributed Data Structures: WDAS 2002.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Witold Litwin Riad Mokadem Thomas Schwartz Disk Backup Through Algebraic Signatures.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Network Topologies.
LH* RE : A Scalable Distributed Data Structure with Recoverable Encryption Keys 1 ( Work in Progress, Jan 09) ( Provisional Patent Appl.) Sushil JajodiaWitold.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
1 SDDS-2000 : A Prototype System for Scalable Distributed Data Structures on Windows 2000 SDDS-2000 : A Prototype System for Scalable Distributed Data.
Basic File Structures and Hashing Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
LH* RS P2P : A Scalable Distributed Data Structure for P2P Environment W. LITWIN CERIA Laboratory H.YAKOUBEN Paris Dauphine University
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
1 High-Availability LH* Schemes with Mirroring W. Litwin, M.-A. Neimat U. Paris 9 & HPL Palo-Alto
Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 WDAS – 14 June THESSALONIKI(Greece) Range Queries to Scalable Distributed Data Structure RP* WDAS – 14 June THESSALONIKI(Greece) Range.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
1 Scalable Distributed Data Structures S tate-of-the-art Part 1 Witold Litwin Paris 9
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
1 Scalable Distributed Data Structures Part 2 Witold Litwin Paris 9
LH* RS P2P : A Scalable Distributed Data Structure for P2P Environment W. LITWIN CERIA Laboratory H.YAKOUBEN Paris Dauphine University
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Chapter 5 Record Storage and Primary File Organizations
Introduction and File Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems,
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
The Variable-Increment Counting Bloom Filter
Hash-Based Indexes Chapter 11
Hashing CENG 351.
LH*RSP2P: A Scalable Distributed Data Structure for P2P Environment
External Memory Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Implementation issues
Database Systems (資料庫系統)
Erasure Correcting Codes for Highly Available Storage
Chapter 11 Instructor: Xin Zhang
Presentation transcript:

1 High-Availability in Scalable Distributed Data Structures W. Litwin

2 PlanPlan n What are SDDSs ? n High-Availability SDDSs n LH* with scalable availability n Conclusion

3 MulticomputersMulticomputers n A collection of loosely coupled computers –common and/or preexisting hardware –share nothing architecture –message passing through high-speed net (  Mb/s) n Network multicomputers –use general purpose nets »LANs: Fast Ethernet, Token Ring, SCI, FDDI, Myrinet, ATM… –NCSA cluster : 512 NTs on Myrinet by the end of 1998 n Switched multicomputers –use a bus, or a switch –IBM-SP2, Parsytec...

4 Client Server Network multicomputer

5 Why multicomputers ? n Unbeatable price-performance ratio –Much cheaper and more powerful than supercomputers »especially the network multicomputers –1500 WSs at HPL with 500+ GB of RAM & TBs of disks n Computing power –file size, access and processing times, throughput... n For more pro & cons : –IBM SP2 and GPFS literature –Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995 –NOW project (UC Berkeley) –Bill Gates at Microsoft Scalability Day, May 1997 – White Papers from Business Syst. Div. –Report to the President, President’s Inf. Techn. Adv. Comm., Aug 98

6 Why SDDSs n Multicomputers need data structures and file systems n Trivial extensions of traditional structures are not best  hot-spots  scalability  parallel queries  distributed and autonomous clients  distributed RAM & distance to data

7 What is an SDDS ? +Data are structured +records with keys  objects with an OID + more semantics than in Unix flat-file model + abstraction popular with applications + allows for parallel scans +function shipping +Data are on servers –always available for access +Overflowing servers split into new servers –appended to the file without informing the clients +Queries come from multiple autonomous clients –available for access only on their initiative »no synchronous updates on the clients +There is no centralized directory for access computations

8 +Clients can make addressing errors »Clients have less or more adequate image of the actual file structure, Servers are able to forward the queries to the correct address –perhaps in several messages + Servers may send Image Adjustment Messages »Clients do not make same error twice n See the SDDS talk for more on it – – /witold.html n Or the LH* ACM-TODS paper (Dec. 96) What is an SDDS ?

9 An SDDS Clients growth through splits under inserts Servers

10 An SDDS Clients growth through splits under inserts Servers

11 An SDDS Clients growth through splits under inserts Servers

12 An SDDS Clients growth through splits under inserts Servers

13 An SDDS Clients growth through splits under inserts Servers

14 An SDDS Clients

15 Clients An SDDS

16 Clients IAM An SDDS

17 Clients An SDDS

18 Clients An SDDS

19 Known SDDSs DS Classics

20 Known SDDSs Hash SDDS (1993) LH* DDH Breitbart & al DS Classics

21 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer DS Classics

22 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees k-RP* dPi-tree DS Classics

23 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree

24 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer Breitbart & Vingralek m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree s-availability LH* SA LH* RS

25 LH* ( A classic) n Allows for the primary key (OID) based hash files –generalizes the LH addressing schema »variants used in Netscape products, LH-Server, Unify, Frontpage, IIS, MsExchange... n Typical load factor % n In practice, at most 2 forwarding messages –regardless of the size of the file n In general, 1 m/insert and 2 m/search on the average n 4 messages in the worst case n Search time of 1 ms (10 Mb/s net), of 150  s (100 Mb/s net) and of 30  s (Gb/s net)

26 High-availability LH* schemes n In a large multicomputer, it is unlikely that all servers are up n Consider the probability that a bucket is up is 99 % –bucket is unavailable 3 days per year n If one stores every key in only 1 bucket –case of typical SDDSs, LH* included n Then file reliability : probability that n- bucket file is entirely up is: » 37 % for n = 100 »0 % for n = 1000 n Acceptable for yourself ?

27 High-availability LH* schemes n Using 2 buckets to store a key, one may expect the reliability of: –99 % for n = 100 –91 % for n = 1000 n High-availability files –make data available despite unavailability of some servers »RAIDx, LSA, EvenOdd, DATUM... n High-availability SDDS –make sense –are the only way to reliable large SDDS files

28 Known High-availability LH* schemes n Known high-availability LH* schemes keep data available under : –any single server failure (1-availability) –any n-server failure »n fixed or scalable (n-availability or scalable availability) –some n’-server failures ; n’ > n n Three principles for high-availability SDDS schemes are known –mirroring (LH*m) »storage doubles ; 1-availability –striping (LH*s) »affects parallel queries ; 1-availability –grouping (LH* g, LH* SA, LH *RS )

29 Scalable Availability n n-availability –availability of all data despite simultaneous unavailability of up to any n buckets »E.g., RAIDx, EvenOdd, RAV, DATUM... n Reliability –probability P that all the records are available n Problem –For every choice of n, P  0 when the file scales. n Solution –Scalable availability »n grows with the file size, to regulate P –Constraint »the growth has to be incremental

30 LH*sa file (Litwin, Menon, Risch) n An LH* file with data buckets for data records –provided in addition with the availability level i in each bucket n One or more parity files with parity buckets for parity records –added when the file scales –with every bucket 0 mirroring the LH*sa file state data (i, n) n A family of grouping functions group data buckets into groups of size k > 1 such that: –Every two buckets in the same group i are in different groups i’  i n There is one parity bucket per data bucket group n Within a parity bucket, a parity record is maintained for up to k data records with the same rank

31 LH*sa File Expansion (k = 2)

32 Scalable Availability (basic schema) n Up to k data buckets, use 1-st level grouping –so there will be only one parity bucket n Then, start also 2-nd level grouping n When file exceeds k 2 buckets, start 3-rd level grouping n When file exceeds k 3 buckets, start 4-rd level grouping n Etc.

33 LH*sa groups

34 LH*sa File Expansion (k = 2)

35 LH*sa Recovery n Bucket or record unavailability is detected –by the client during the search or update –by forwarding server n Coordinator is alerted to perform the recovery –to bypass the unavailable bucket –or to restore the record on the fly –or the restore the bucket in a spare n The recovered record is delivered to the client

36 LH*sa Bucket & Record Recovery n Try the 1-st level group for the unavailable bucket m n If other buckets are found unavailable in this group –try to recover each of them using 2nd level groups n And so on… n Come back to recover finally bucket –See the paper for the full algorithms n For an I -available file, it is possible to sometimes recover a records even when more than I buckets in a group are unavailable

37 LH*sa normal recovery

38 Good case recovery (I = 3)

39 Scalability Analysis Search Performance n Search –usually same cost as for LH* »inluding the parallel queries »2 messages per key search –in degraded mode: »usually O (k) –record reconstruction cost using 1-st level parity »worst case : O ((I+1) k)

40 Insert Performance n Usually : (I+1) or (I + 2) –(I+1) is the best possible value for every I-availability schema n In degraded mode: –about the same if the unavailable bucket can be bypassed –add the bucket recovery cost otherwise »the client cost is only a few messages to deliver the record to the coordinator

41 Split Performance n LH* split cost that is O (b/2) –b is bucket capacity –one message per record n Plus usually O (Ib) messages to parity buckets –to recompute (XOR) parity bits since usually all records get new ranks n Plus O (b) messages when new bucket is created

42 Storage Performance n Storage overhead cost C s = S’ / S –S’ - storage for parity files –S - storage for data buckets »practically, the LH* file storage cost n C s depends on the file availability level I reached n To build new level to I + 1: –C s starts from lower bound L I = I / k »for file size M = k I »the best possible value for any I-availability schema –Increases towards an upper bound U I + 1  O ( ½ + I /k) »as long as new splits add parity buckets –Decreases towards L I+1 afterwards

43Example

44ReliabilityReliability n Probability P that all records are available to the application –all the data buckets are available –or every record can be recovered »there is at most I buckets failed in an LH*sa group n Depends on –failure probability p of each site –group size k –file size M n Reliability of basic LH*sa schema is termed uncontrolled

45 Uncontrolled ReliabilityLH*saLH*sa

46 Controlled Reliability n To keep the reliability above or close to given threshold through –delaying or accelerating the availability level growth –or gracefully changing group size k n Necessary for higher values of p –case of less reliable sites »frequent situation on network multicomputers n May improve performance for small p’s. n Several schemes are possible

47 Controlled Reliability with Fixed Group Size p = 0.2 k = 4 T = 0.8

48 Controlled Reliability with Variable Group Size p = 0.01 T = 0.95

49 LH* RS (Litwin & Schwarz) n Single grouping function –1234, 5678… n Multiple parity buckets per group n Scalable availability –1 parity bucket per group until 2 i 1 buckets –Then, at each split, add 2nd parity bucket to each existing group or create 2 parity buckets for new groups until 2 i 2 buckets –etc.

50 LH* RS File Expansion

51 LH* RS File Expansion

52 LH* RS File Expansion

53 LH* RS Parity Calculus n Choose GF(2 l ) –Typically GF (16) or GF (256) n Create the k x n generator matrix G –using elementary transformation of extended Vandermond matrix of GF elements –k is the records group size –n = 2 l is max segment size (data and parity records) –G = [I | P] –I denotes the identity matrix n Each record is a sequence of symbols from GF(2 l ) n The k symbols with the same offset in the records of a group become the (horizontal) information vector U n The matrix multiplication U G provides the (n - k) parity symbols, i.e., the codeword vector

54 LH* RS Parity Calculus n Parity calculus is distributed to parity buckets –each column is at one bucket n Parity is calculated only for existing data and parity buckets –At each insert, delete and update n Adding new parity buckets does not change existing parity records

55 Example: GF(4) Addition : XOR Multiplication : direct table or log / antilog tables

56EncodingEncoding Records :

57EncodingEncoding Codewords Records :

58EncodingEncoding … Codewords Records :

59 LH* RS Recovery Calculus n Performed when at most n - k buckets are unavailable, among the data and the parity buckets of a group : n Choose k available buckets n Form the submatrix H of G from the corresponding columns n Invert this matrix into matrix H -1 n Multiply the horizontal vector S of available symbols with the same offset by H -1 n The result contains the recovered data and/or parity symbols

60ExampleExample Buckets

61ExampleExample Buckets

62ExampleExample Recovered symbols / buckets Buckets

63RecoveryRecovery … …... Recovered symbols / buckets Buckets

64ConclusionConclusion n High-availability is an important property of an SDDS n Its design should preserve the scalability, parallelism & reliability n Schemes using record grouping seem most appropriate

65 Future Work n Performance analysis of LH* RS n Implementation of any of high-availability SDDSs –LH* RS is now implemented at CERIA by Mattias Ljungström n High-availability variants of other known SDDSs

66 EndEnd Witold Litwin Thank you for your attention

67