Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa Thomas J.E. Schwartz /thomas_schwarz.html Workshop in Distributed Data & Structures * July 2004
2 Objective Factors of Interest are : Parity Overhead Recovery Performances LH* RS DesignImplementation Performance Measurements
3 1. Motivation 2. Highly-available schemes 3. LH* RS 4. Architectural Design 5. Hardware testbed 6. File Creation 7. High Availability 8. Recovery 9. Conclusion 10. Future Work Scenario Description Performance Results Overview
4 Motivation Information Volume of 30% / year Bottleneck of disk access and CPUs Failures are frequent & costly Business OperationIndustryAverage Hourly Financial Impact Brokerage (Retail) operations Financial$6.45 million Credit Card Sales Authorization Financial$2.6 million Airline Reservation Centers Transportation$89,500 Cellular (new) Service Activation Communication$41,000 Source: Contingency Planning Research -1996
5 Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability
6 Scalable & Distributed Data Structure Dynamic file growth Client Network Client … Data Buckets (DBs) … Coordinator I’m Overloaded ! You Split ! Records Transfered Inserts …
7 SDDS (Ctnd.) Network No Centralized Directory Access Data Buckets (DBs) …… Client Query Query Forwarded Image Adjustement Message …
8 Solutions towards High Availability Parity Calculus (+) Good Response time since mirrors are queried (-) High storage cost ( n if n repliquas) Data Replication Erasure-resilient codes are evaluated regarding: Coding Rate (parity volume / data volume) Coding Rate (parity volume / data volume) Update Penality Update Penality Group Size used for Data Reconstruction Group Size used for Data Reconstruction Complexity of Coding & Decoding Complexity of Coding & Decoding
9 Fault-Tolerant Schemes 1 server failure More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04] Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH* RS [Litwin & Schwarz, 00] Tolerate large number of failures … Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96]
A Highly Available & Distributed Data Structure: LH* RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]
11 LH* RSSDDS Data Distribution scheme based on Linear Hashing : LH* LH [Karlesson et al., 96] applied to the key-field Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]Scalability High Throughput High Availability
12 LH* RS File Structure Data Buckets Parity Buckets : Key Data Field Insert Rank r : Rank [Key List] Parity Field Key r
Architectural Design of LH* RS
14 Communication Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Better Performance & Reliability than UDP
15 Network Multicast Listening Port Send UDP Port Message Queue -Message processing- TCP/IP Port Process Buffer Recv UDP Port Message Queue -Message processing- … … Message TCP Connection WindowFree Zones Sending Credit Messages waiting for ack … Not ack’ed messages Multicast Listening Thread Multicast Working Thread Ack. Management Thread UDP Listening Thread TCP Listening Thread Work. Thread 1 Work. Thread n Bucket Architecture
16 Architectural Design TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Enhancements to SDDS2000 [B00, D01] Bucket Architecture Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01] Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Before
17 Architectural Design (Ctnd.) A pre-defined & static Table Dynamic Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe DBs, PBs Coordinator Blank DBs Multicast Group Blank PBs Multicast Group Multicast Component Before
18 Hardware Testbed 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) Ethernet Network: max bandwidth of 1 Gbps Operating System: Windows 2K Server Tested configuration 1 Client A group of 4 Data Buckets k Parity Buckets, k {0, 1, 2}
LH* RS File Creation
20 File Creation Client Operation Splitting Data Bucket PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank + (Records that move) N Deletes New Data Bucket PBs: N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB. Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split
21 File Creation Perf. Experiments Set-up File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 k = 1 to k = 2 Perf. Degradation of 8% PB Overhead k = 0 to k = 1 Perf. Degradation of 20%
22 File Creation Perf. Experimental Set-up File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 k = 1 to k = 2 Perf. Degradation of 10% PB Overhead k = 0 to k = 1 Perf. Degradation of 37%
LH* RS Parity Bucket Creation
24 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group Wanna join group g ? [Sender Your Entity#] Searching for a new PB
25 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group I would Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all Waiting for Replies
26 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group You are Selected Disconnect from Blank PBs Multicast Group Cancellation PB Selection
27 Send me your contents ! … PB Creation Scenario Data Bucket’s group New PB … Auto-creation -Query phase
28 Requested Buffer … PB Creation Scenario Data Bucket’s group New PB Buffer Processing … Auto-creation –Encoding phase
29 PB Creation Perf. Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) XOR Encoding RS Encoding Comparison Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records Bucket Size: PT 74% TT ,608 Encoding Rate MB/sec
30 PB Creation Perf. Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) XOR Encoding RS Encoding Comparison Bucket Size; PT 74% TT ,618 Encoding Rate MB/sec
31 PB Creation Perf. XOR Encoding RS Encoding Comparison XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : MB/sec XOR provides a performance gain of 5% in Processing Time ( 0.02% in the Total Time) For Bucket Size = 50000
LH* RS Bucket Recovery
33 Coordinator Failure Detection Are You Alive ? Data Buckets Parity Buckets Buckets’ Recovery
34 Coordinator Waiting for Replies… I am Alive ? Data Buckets Parity Buckets Buckets’ Recovery
35 Coordinator Searching for 2 Spare DBs… DBs Connected to The Blank DBs Multicast Group Wanna be a Spare DB ? [Sender Your Entity#] Buckets’ Recovery
36 Coordinator Waiting for Replies … DBs Connected to The Blank DBs Multicast Group Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all I would Buckets’ Recovery
37 Coordinator Spare DBs Selection DBs Connected to The Blank DBs Multicast Group Disconnect from Blank PBs Multicast Group You are Selected Disconnect from Blank PBs Multicast Group Cancellation Buckets’ Recovery
38 Coordinator Parity Buckets Recover Buckets [Spares Buckets’ Recovery Recovery Manager Determination
39 Data Buckets Parity Buckets Recovery Manager Spare DBs Alive Buckets participating to Recovery Send me Records of rank in [r, r+slice-1] … Buckets’ Recovery Query Phase
40 Decoding Process Recovered Records Data Buckets Parity Buckets Recovery Manager Spare DBs Alive Buckets participating to Recovery Requested Buffer … Buckets’ Recovery Reconstruction Phase
41 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs MB XOR Decoding RS Decoding Comparison SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 0.72
42 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs MB XOR Decoding RS Decoding Comparison SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 0.85
43 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : sec XOR provides a performance gain of 15% in Total Time 1DB Recovery Time – RS : sec
44 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs MB Recover 2 DBs Recover 3 DBs SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 1.2
45 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs MB Recover 2 DBs Recover 3 DBs SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 1.6
46 Perf. Summary of Bucket Recovery 1 DB (3.125 MB) in 0.7 sec (XOR) 4.46 MB/sec 1 DB (3.125 MB) in 0.85 sec (RS) 3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS) 5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS) 5.86 MB/sec
47 Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture I mpact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second
48 Conclusion LH* RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability
49 Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes
50 References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp , June [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp [Xu et al.,99] L. Xu & Jehoshua Bruck, X- Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p , [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3 rd USENIX – Conf. On File and Storage Technologies, Avril [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdfftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR , 1995.
51 References (Ctnd.) [Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH* RS : A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p , Proceedings of the ACM SIGMOD [Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, [Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp , [Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. [Moussa] More references:
End
53 Parity Calculus Galois Field GF[2 8 ] 1 symbol is 1 byte || GF[2 16 ] 1 symbol is 2 bytes GF[2 8 ] 1 symbol is 1 byte || GF[2 16 ] 1 symbol is 2 bytes (+) GF[2 16 ] vs. GF[2 8 ] reduces by 1/2 the # of symbols, and consequently number of opertaions in the field (-) Multiplication Tables Sizes New Generator Matrix 1st Column of ‘1’s 1st parity bucket executes XOR calculus instead of RS calculus 1st parity bucket executes XOR calculus instead of RS calculus gain performance in encoding of 20% 1st line of ‘1’s Each PB executes XOR calculus for any update from the 1st DB of any group gain performance of 4% - measured for PB creation Encoding & Decoding Hints Encoding log pre-calculus of the P matrix coefficents improv. of 3.5% Decoding log pre-calculus of H -1 matrix coef. and b vector for multiple buckets recovery improv. from 4% to 8%