Download presentation
Presentation is loading. Please wait.
Published byEvangeline Jefferson Modified over 9 years ago
1
1 SDDS-2000 : A Prototype System for Scalable Distributed Data Structures on Windows 2000 SDDS-2000 : A Prototype System for Scalable Distributed Data Structures on Windows 2000 Witold Litwin Witold.Litwin@dauphine.fr Witold.Litwin@dauphine.fr
2
2 PlanPlan n What are SDDSs ? n Where are we in 2002 ? –LH* : Scalable Distributed Hash Partitioning –Scalable Distributed Range Partitioning RP* –High-Availability : LH* RS & RP* RS –DBMS coupling : AMOS-SDDS & SD-AMOS n Architecture of SDDS-2000 n Experimental performance results n Conclusion –Future work
3
3 What is an SDDS n A new type of data structure –Specifically for multicomputers n Designed for data intensive files : – horizontal scalability to very large sizes »larger than any single-site file – parallel and distributed processing »especially in (distributed) RAM –Record access time better than for any disk file –100-300 s usually under Win 2000 (100 Mb/s net, 700 MHZ CPU, 100B – 1 KB records) – access by multiple autonomous clients
4
4 Killer apps n Any traditional application using a large hash or B- tree or k-d file –Access time to an RAM SDDS record is 100 faster than to a disk record n Network storage servers (SNA and NAS) n DBMSs n WEB servers n Video servers n Real-time systems n High Perf. Comp.
5
5 MulticomputersMulticomputers n A collection of loosely coupled computers –common and/or preexisting hardware –share nothing architecture –message passing through high-speed net ( Mb/s) n Network multicomputers –use general purpose nets »LANs: Ethernet, Token Ring, Fast Ethernet, SCI, FDDI... »WANs: ATM... n Switched multicomputers –use a bus, or a switch »e.g., IBM-SP2, Parsytec
6
6 Client Server Typical Network Multicomputer Network segments
7
7 Why multicomputers ? n Potentially unbeatable price-performance ratio –Much cheaper and more powerful than supercomputers »1500 WSs at HPL with 500+ GB of RAM & TBs of disks n Potential computing power –file size –access and processing time –throughput n For more pro & cons : –Bill Gates at Microsoft Scalability Day –NOW project (UC Berkeley) –Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995 –www.microoft.com White Papers from Business Syst. Div.
8
8 Why SDDSs n Multicomputers need data structures and file systems n Trivial extensions of traditional structures are not best hot-spots scalability parallel queries distributed and autonomous clients distributed RAM & distance to data
9
9 Distance to data (Jim Gray) 100 ns 1 sec 10 msec RAM distant RAM (gigabit net) local disk 100 sec distant RAM (Ethernet)
10
10 Distance to Data (Jim Gray) 1 s 10 s 10 ms RAM distant RAM (gigabit net) 100 s distant RAM (Ethernet) 1 m 10 m 2 h 8 d Moon local disk
11
11 Scalability Dimensions (Client view) Data size # servers and # clients Operation time (distant RAM) # operation / s Scale-up Linear (idéal) Sub-linear (usuel)
12
12 Scalability Dimensions (Client view, SDDS specific) Data size Multicomputer & SDDS Scale-up Linear Sub-linear (usuel) Disk & Cache Tapes, juke- box… Cache & Disk Operation time Single comp. Local RAM Cluster- Computer Cluster-Computer- = fix # of servers
13
13 Scalability Dimensions (Client view) # servers # operations/ s Speed-up Linear (ideal) Sub-linear (usuel)
14
14 What is an SDDS ? + Queries come from multiple autonomous clients + Data are on servers + Data are structured +records with keys objects with OIDs + more semantics than in Unix flat-file model + abstraction most popular with applications + parallel scans & function shipping + Overflowing servers split into new servers
15
15 An SDDS Clients growth through splits under inserts Servers
16
16 An SDDS Clients growth through splits under inserts Servers
17
17 An SDDS Clients growth through splits under inserts Servers
18
18 An SDDS Clients growth through splits under inserts Servers
19
19 SDDS : Addressing Principles + SDDS Clients + Are not informed about the splits. + Do not access any centralized directory for record address computations + Have each a less or more adequate private image of the actual file structure + Can make addressing errors + Sending queries or records to incorrect servers + Searching for a record that was moved elsewhere by splits + Sending a record that should be elsewhere for the same reason
20
20, Servers are able to forward the queries to the correct address –perhaps in several messages + Servers may send Image Adjustment Messages »Clients do not make same error twice Servers supports parallel scans Sent out by multicast or unicast With deterministic or probabilistic termination n See the SDDS talk & papers for more –ceria.dauphine.fr/witold.html n Or the LH* ACM-TODS paper (Dec. 96) What is an SDDS ? SDDS : Addressing Principles
21
21 An SDDS Client Access Clients
22
22 Clients An SDDS Client Access
23
23 Clients IAM An SDDS Client Access
24
24 Clients An SDDS Client Access
25
25 Clients An SDDS Client Access
26
26 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer Breitbart & Vingralek m-d trees DS Classics LH*s k-RP* dPi-tree Nardelli-tree s-availability LH* SA LH* RS http://ceria.dauphine/SDDS-bibliograhie.html SDLSA Disk LH*m, LH*g Security
27
27 LH* (A classic) n Scalable distributed hash partitioning –Transparent for the applications »Unlike the current static schemes (i.e. DB2) –Generalizes the LH addressing schema »used in many products –SQL-Server, IIS, MsExchange, Frontpage, Netscape suites, Berkeley DB Library, LH-Server, Unify... n Typical load factor 70 - 90 % n In practice, at most 2 forwarding messages –regardless of the size of the file n In general, 1 message/insert and 2 messages/search on the average n 4 messages in the worst case n Several variants are known –LH* LH is most studied
28
28 Overview of LH n Extensible hash algorithm –Widely used, e.g., »Netscape browser (100M copies) »LH-Server by AR (700K copies sold) »MS Frontpage, Exchange, IIS –tought in most DB and DS classes –address space expands » to avoid overflows & access performance deterioration n the file has buckets with capacity b >> 1 n Hash by division h i : c -> c mod 2 i N provides the address h (c) of key c. n Buckets split through the replacement of h i with h i+1 ; i = 0,1,.. n On the average, b/2 keys move towards new bucket
29
29 Overview of LH n Basically, a split occurs when some bucket m overflows n One splits bucket n, pointed by pointer n. –usually m n n n évolue : 0, 0,1, 0,1,..,2, 0,1..,3, 0,..,7, 0,..,2 i N, 0.. n One consequence => no index –characteristic of other EH schemes
30
30 LH File Evolution 35 12 7 15 24 h 0 ; n = 0 N = 1 b = 4 i = 0 h 0 : c -> 2 0 0
31
31 LH File Evolution 35 12 7 15 24 h 1 ; n = 0 N = 1 b = 4 i = 0 h 1 : c -> 2 1 0
32
32 LH File Evolution 12 24 h 1 ; n = 0 N = 1 b = 4 i = 1 h 1 : c -> 2 1 0 35 7 15 1
33
33 LH File Evolution 32 58 12 24 N = 1 b = 4 i = 1 h 1 : c -> 2 1 0 21 11 35 7 15 1 h1h1h1h1 h1h1h1h1
34
34 LH File Evolution 32 12 24 N = 1 b = 4 i = 1 h 2 : c -> 2 2 0 21 11 35 7 15 1 58 2 h2h2h2h2 h1h1h1h1 h2h2h2h2
35
35 LH File Evolution 32 12 24 N = 1 b = 4 i = 1 h 2 : c -> 2 2 0 33 21 11 35 7 15 1 58 2 h2h2h2h2 h1h1h1h1 h2h2h2h2
36
36 LH File Evolution 32 12 24 N = 1 b = 4 i = 1 h 2 : c -> 2 2 0 33 21 1 58 2 h2h2h2h2 h2h2h2h2 h2h2h2h2 11 35 7 15 3 h2h2h2h2
37
37 LH File Evolution 32 12 24 N = 1 b = 4 i = 2 h 2 : c -> 2 2 0 33 21 1 58 2 h2h2h2h2 h2h2h2h2 h2h2h2h2 11 35 7 15 3 h2h2h2h2
38
38 LH File Evolution n Etc –One starts h 3 then h 4... n The file can expand as much as needed –without too many overflows ever
39
39 Addressing Algorithm a <- h (i, c) if n = 0 then exit else if a < n then a <- h (i+1, c) ; end
40
40 LH*LH* n Property of LH : –Given j = i or j = i + 1, key c is in bucket m iff h j (c) = m ; j = i or j = i + 1 »Verify yourself n Ideas for LH* : –LH addresing rule = global rule for LH* file –every bucket at a server –bucket level j in the header –Check the LH property when the key comes form a client
41
41 LH* : file structure j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers
42
42 LH* : file structure j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers
43
43 LH* : split j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers
44
44 LH* : split j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers
45
45 LH* : split j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10
46
46 LH* Addressing Schema n Client –computes the LH address m of c using its image, –send c to bucket m n Server –Server a getting key c, a = m in particular, computes : a' := h j (c) ; if a' = a then accept c ; else a'' := h j - 1 (c) ; if a'' > a and a'' a and a'' < a' then a' := a'' ; send c to bucket a' ;
47
47 LH* Addressing Schema n Client –computes the LH address m of c using its image, –send c to bucket m n Server –Server a getting key c, a = m in particular, computes : a' := h j (c) ; if a' = a then accept c ; else a'' := h j - 1 (c) ; if a'' > a and a'' a and a'' < a' then a' := a'' ; send c to bucket a' ; n n See [LNS93] for the (long) proof Simple ?
48
48 Client Image Adjustement n The IAM consists of address a where the client sent c and of j (a) –i' is presumed i in client's image. –n' is preumed value of pointer n in client's image. –initially, i' = n' = 0. if j > i' then i' := j - 1, n' := a +1 ; if j > i' then i' := j - 1, n' := a +1 ; if n' 2^i' then n' = 0, i' := i' +1 ; n The algo. garantees that client image is within the file [LNS93] –if there is no file contractions (merge)
49
49 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10 15
50
50 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10 15
51
51 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 1, i' = 3n' = 3, i' = 2 Coordinator Client servers j = 4 10 15 j = 4
52
52 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10 9
53
53 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10 9
54
54 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10 9
55
55 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 1, i' = 3n' = 3, i' = 2 Coordinator Client servers j = 4 10 9 j = 4
56
56 ResultResult n The distributed file can grow to even whole Internet so that : –every insert and search are done in four messages (IAM included) –in general an insert is done in one message and search in two messages –proof in [LNS 93]
57
57 10,000 inserts Global cost Client's cost
58
58
59
59
60
60 Inserts by two clients
61
61 Parallel Queries n A query Q for all buckets of file F with independent local executions –every buckets should get Q exactly once n The basis for function shipping –fundamental for high-perf. DBMS appl. n Send Mode : –multicast »not always possible or convenient –unicast »client may not know all the servers »severs have to forward the query –how ?? Image File Q
62
62 LH* Algorithm for Parallel Queries (unicast) n Client sends Q to every bucket a in the image n The message with Q has the message level j' : –initialy j' = i' if n' i' else j' = i' + 1 –bucket a (of level j ) copies Q to all its children using the alg. : while j' < j do j' := j' + 1 forward (Q, j' ) to bucket a + 2 j' - 1 ; endwhile n Prove it !
63
63 Termination of Parallel Query (multicast or unicast) n How client C knows that last reply came ? n Deterministic Solution (expensive) –Every bucket sends its j, m and selected records if any » m is its (logical) address –The client terminates when it has every m fullfiling the condition ; »m = 0,1..., 2 i + n where – i = min (j) and n = min (m) where j = i i+1i n
64
64 Termination of Parallel Query (multicast or unicast) n Probabilistic Termination ( may need less messaging) –all and only buckets with selected records reply –after each reply C reinitialises a time-out T –C terminates when T expires n Practical choice of T is network and query dependent –ex. 5 times Ethernet everage retry time »1-2 msec ? –experiments needed n Which termination is finally more useful in practice ? –an open problem
65
65 LH* variants n With/without load (factor) control n With/without the (split) coordinator –the former one was discussed –the latter one is a token-passing schema »bucket with the token is next to split –if an insert occurs, and file overload is guessed –several algs. for the decision –use cascading splits n See the talk on LH* at http://ceria.dauphine.fr/ http://ceria.dauphine.fr/
66
66 RP* schemes n Produce scalable distributed 1-d ordered files –for range search n Each bucket (server) has the unique range of keys it may contain n Ranges partition the key space n Ranges evolve dynamically through splits –Transparently for the application n Use RAM m-ary trees at each server –Like B-trees –Optimized for the RP* split efficiency
67
67 Current PDBMS technology (Ex. Non-Stop SQL) n Static Range Partitioning n Done manually by DBA n Requires goods skills n Not scalable
68
68 RP* schemes
69
69 High-availability SDDS schemes n Data remain available despite : –any single server failure & most of two server failures –or any up to n-server failure –and some catastrophic failures n n scales with the file size –To offset the reliability decline which would otherwise occur n Three principles for high-availability SDDS schemes are currently known –mirroring (LH*m) –striping (LH*s) –grouping (LH*g, LH*sa, LH*rs, RP* rs) n Realize different performance trade-offs
70
70 LH* RS : Record Groups n LH* RS records –LH* data records & parity records n Records with same rank r in the bucket group form a record group n Each record group gets n parity records –Computed using Reed-Salomon erasure correction codes »Additions ans multiplications in Galois Fields »See the Sigmod 2000 paper on the Web site for details n r is the common key of these records n Each group supports unavailability of up to n of its members
71
71 LH* RS Record Groups Data recordsParity records
72
72 LH* RS : Parity Management n An insert of data record with rank r creates or, usually, updates parity records r n An update of data record with rank r updates parity records r n A split recreates parity records –Data record usually change the rank after the split
73
73 LH* RS Scalable availability n Create 1 parity bucket per group until M = 2 i 1 buckets n Then, at each split, –add 2 nd parity bucket to each existing group –create 2 parity buckets for new groups until 2 i 2 buckets n etc.
74
74 LH* RS Scalable availability
75
75 LH* RS Scalable availability
76
76 LH* RS Scalable availability
77
77 LH* RS Scalable availability
78
78 LH* RS Scalable availability
79
79 SDDS-2000 : global architecture
80
80 SDDS-2000 : global architecture Applications etc SDDS Data server SDDS Data server SDDS Data server SDDS-2000 Server SDDS-2000 Client Network UDP & TCP
81
81 SDDS-2000: Client Architecture (RP*c) 2 Modules Send Module Receive Module Multithread Architecture SendRequest ReceiveRequest AnalyzeResponse1..4 GetRequest ReturnResponse Synchronization Queues Client Images Flow control
82
82 SDDS-2000: Server Architecture (RP*c) Multithread architecture Synchronization queues Listen Thread for incoming requests SendAck Thread for flow control Work Threads for request processing response sendout request forwarding UDP for shorter messages (< 64K) TCP/IP for longer data exchanges Several buckets of different SDDS files
83
83 AMOS-SDDS Architecture n For database queries –Especially parallel scans n Couples SDDS-2000 and Amos II : –RAM OR-DBMS. –AMOSQL declarative query language » can be embedded into C and Lisp. –call-level interface (callin ) –external procedures (functions) (callout) n See the AMOS-II talk & papers for more http://www.dis.uu.se/~udbl/
84
84 AMOS-SDDS Architecture n SDDS is used as the distributed RAM storage manager. –RP* scheme for the scalable distributed range partitioning. –Like in a B-tree, records are lexicographically ordered according to their keys. –Supports efficiently the range queries. n Amos II provide a fast SDDS-based RAM OR- DBMS. n The callout capability realizes the AMOSQL object- relational capability usually called external or foreign functions.
85
85 AMOS-SDDS Architecture
86
86 AMOS-SDDS Architecture AMOS-SDDS scalable distributed query processing
87
87 AMOS-SDDS Server Query Processing n E-strategy –Data stay external to AMOS »within the SDDS bucket –Custom foreign functions perform the query n I-strategy –Data are on-the-fly imported into AMOS-II »Perhaps with the local index creation –Good for joins –AMOS performs the query n Which strategy is preferable ? –Good question
88
88 SD-AMOSSD-AMOS Server storage manager is a full scale DBMS –AMOS-II in our case since it is a RAM DBMS –Could be any DBMS n SDDS-2000 provides the scalable distributed partitioning schema n Server DBMS performs the splits –When et How ??? n Client manages scalable query decomposition & execution –Easier said than done n The whole system generalizes the PDBMS technology –Static partitioning only
89
89 Scalability Analysis n Theoretical –To validate an SDDS »See the papers –To get an idea of system performance »Limited validity n Experimental –More accurate validation of design issues –Practical necessity –Costs orders of magnitude more of time and money
90
90 Experimental Configuration n 6 machines : 700 Mhz P3 n 100 Mb/s Ethernet n 150 byte records
91
91 LH* file creation Ph. D Thesis of F. Bennour, 2000 Bucket size b = 5.000 Flow control On # of buckets # of inserts with splits w / splits Time (ms) LH* Scalability is confirmed Performance bound by the client processing speed
92
92 LH* Key search File Size in records and # servers Time (ms) Actual client image New client image & IAMs Performance bound by the client processing speed 2345
93
93 LH* RS Experimental Performance (Preliminary results) Insert time during the file creation (moving average) Number of records Time (ms)
94
94 LH* RS Experimental Performance (Preliminary results) File Creation Time Number of records Time (s)
95
95 LH* RS Experimental Performance (Preliminary results) n Normal key search –Unaffected by the parity calculus » 0.3 ms per key search n Degraded key search –About 2 ms for the application »1.1 ms (k =4) for the record recovery »1 ms for the client time-out and the coordinator action n Bucket recovery at the spare –0.3 ms per record (k =4)
96
96 LH* RS Experimental Performance (Preliminary results) Insert time during the file creation (moving average) Number of records Time (ms)
97
97 LH* RS Experimental Performance (Preliminary results) File Creation Time Number of records Time (s)
98
98 LH* RS Experimental Performance (Preliminary results) n Normal key search –Unaffected by the parity calculus » 0.3 ms per key search n Degraded key search –About 2 ms for the application »1.1 ms (k =4) for the record recovery »1 ms for the client time-out and the coordinator action n Bucket recovery at the spare –0.3 ms per record (k =4)
99
99 Performance Analysis (RP*) Experimental Environment Six Pentium III 700 MHz Six Pentium III 700 MHz o Windows 2000 – 128 MB RAM extended later to 256 MB RAM – 100 Mb/s Ethernet Messages Messages – 180 bytes : 80 for the header, 100 for the record – Keys are random integers within some interval – Flow Control sliding window of 10 messages Index Index –Capacity of an internal node : 80 index elements –Capacity of a leaf : 100 records
100
100 Performance Analysis File Creation Bucket capacity : 50.000 records Bucket capacity : 50.000 records 150.000 random inserts by a single client 150.000 random inserts by a single client With flow control (FC) or without With flow control (FC) or without File creation time Average insert time
101
101 DiscussionDiscussion n Creation time is almost linearly scalable n Flow control is quite expensive –Losses without were negligible n Both schemes perform almost equally well –RP* C slightly better »As one could expect n Insert time 30 faster than for a disk file n Insert time appears bound by the client speed
102
102 Performance Analysis File Creation File created by 120.000 random inserts by 2 clients File created by 120.000 random inserts by 2 clients Without flow control Without flow control File creation by two clients : total time and per insert Comparative file creation time by one or two clients
103
103 DiscussionDiscussion n Performance improves n Insert times appear bound by a server speed n More clients would not improve performance of a server
104
104 Performance Analysis Split Time Split times versus bucket capacity
105
105 DiscusionDiscusion n About linear scalability in function of bucket size n Larger buckets are more efficient n Splitting is very efficient –Reaching as little as 40 s per record
106
106 Performance Analysis Key Search A single client sends 100.000 successful random search requests A single client sends 100.000 successful random search requests Flow control : the client sends at most 10 requests without reply Flow control : the client sends at most 10 requests without reply Search time (ms)
107
107 Performance Analysis Key Search Total search time Search time per record A single client sends 100.000 successful random search requests A single client sends 100.000 successful random search requests Flow control : the client sends at most 10 requests without reply Flow control : the client sends at most 10 requests without reply
108
108DiscussionDiscussion n Single search time about 30 times faster than for a disk file – 350 s per search n Search throughput more than 65 times faster than that of a disk file – 145 s per search n RP* N appears again surprisingly efficient with respect RP*c for more buckets
109
109 Performance Analysis Range Query Deterministic termination Deterministic termination Parallel scan of the entire file with all the 100.000 records sent to the client Parallel scan of the entire file with all the 100.000 records sent to the client Range query total time Range query time per record
110
110 DiscussionDiscussion n Range search appears also very efficient –Reaching 10 s per record delivered n More servers should further improve the efficiency –Curves do not become flat yet
111
111 Range Query Parallel Execution Strategies Study of MM. Tsangou (Master Th.) & Prof. Samba (U. Dakar) # servers Response Time (ms) Sc. 3 ; 1 server at the timeSc. 1,2 ; all servers together Sc. 1 ; single connection request per server
112
112 File Size Limits n Bucket capacity : 751K records, 196 MB n Number of inserts : 3M n Flow control (FC) is necessary to limit the input queue at each server
113
113 File Size Limits n Bucket capacity : 751K records, 196 MB n Number of inserts : 3M n GA : Global Average; MA : Moving Average
114
114 Related Works Comparative Analysis Suspicious
115
115 AMOS-SDDSAMOS-SDDS n Benchmark data –Table Pers (SS#, Name, City) –Size 20.000 to 300.000 tuples –50 Cities –Random distribution n Benchmark queries –Join : SS# and Name of persons in the same city »Nested loop or Local index –Count & Join : Count couples in the same city »To determine the result transfer time to the client –Count (*) Pers, Max (SS#) from Pers n Measures –Scale-up & Speed-up –Comparison to AMOS-II alone
116
116 Join : best time per tuple 20.000 tuples in Pers 14.4 2.4 1.6 AMOS-II alone 13.5 Nested loop 2.25 Index lookup 3,990,070 tuples produced
117
117 Join & Count: best time per tuple I-strategy 20.000 tuples in Pers AMOS-II alone 13.5 Nested loop 2.25 Index lookup 3,990,070 tuples produced 1.8 1.0
118
118 Join : Speed-up I-strategy 20.000 tuples in Pers AMOS-II alone 13.5 Nested loop 2.25 Index lookup 3,990,070 tuples produced
119
119 Count : Speed-up 100.000 tuples in Pers E-strategy wins AMOS-II alone: 280 ms 341
120
120 Join Scale-up Performance n The file scales to 300.000 tuples n Spreading from 1 to 15 AMOS-SDDS Servers –Transparently for the application ! n 3 servers per machine –The poor men configuration has only 5 server machines n Results are extrapolated to 1 server per machine –Basically, the CPU component of the elapsed time time is divided by 3
121
121 Join : Elapsed Time Scale-up AMOS-SDDS: I-Strategy with Index Lookup Join
122
122 File size20 00060 000100 000160 000200 000240 000300 000 # SDDS servers1358101215 Result size 3 990 07035 970 60099 951 670255 924 270399 906 670575 889 600899 865 000 Actual Join Time (s)613016841817255539015564 Join & Count (s)51185335986127020222624 Join w. extrp. (s) 613016841324192025533815 Join & Count.w. extrp. (s) 51185335498640681881 AMOS-II (s)46430120131064824697910 933 Index creation time (s) 32987230360520819 Join : Elapsed Time Scale-up
123
123 Join : Time per Tuple Scale-up Join w. Count flat ! Better scalability than any current P-DBMS could provide
124
124 SD-AMOS : File Creation Flat & unexpectedly fast insert time Global Avg. Moving Avg. Insert Time (ms) # servers # inserts b = 4.000
125
125 SD-AMOS : Large File Creation Flat & fast insert time remains Insert Time (ms) Global Avg. Moving Avg. # servers # inserts b = 40.000
126
126 SD-AMOS : Large File Search (time per record) File of 300K RecordsClient max process. speed is reached
127
127 SD-AMOS : Very Large File Creation Bucket size 750 K records Max file size 3M records Record size : 100 B
128
128ConclusionConclusion SDDS-2000 : a prototype SDDS manager for Windows multicomputer SDDS-2000 : a prototype SDDS manager for Windows multicomputer Several variants of LH* and RP* High-availability Scalable distributed database query processing AMOS-SDDS & SD-AMOS
129
129 ConclusionConclusion Experimental performance of SDDS schemes appears in line with the expectations Experimental performance of SDDS schemes appears in line with the expectations Record search & insert times in the range of a fraction of a millisecond About 30 to 100 times faster than a disk file access performance About 30 to 100 times faster than a disk file access performance About ideal (linear) scalability Including the query processing Including the query processing Results prove the overall efficiency of SDDS-2000 system Results prove the overall efficiency of SDDS-2000 system
130
130 Current & Future Work n SDDS-2000 Implementation n High-Availability through RS-Codes –CERIA & U. Santa Clara (Prof. Th. Schwarz) n Disk-based High-Availability SDDSs –CERIA & IBM-Almaden (J. Menon) n Parallel Queries –U. Dakar (Prof. S. Ndiaye) & CERIA n Concurrency & Transactions –U. Dakar (Prof. T. Seck) n Overall Performance Analysis n SD-AMOS & SD-DBMS in general –CERIA & U. Uppsala (Prof. T. Risch) & U. Dakar n SD-SQL-Server ? n Extent dependent basically on available funding
131
131 CreditsCredits n SDDS-2000 Implementation –CERIA Ph. D Students »F. Bennour (Now Post-Doc) (SDDS-2000 ; LH*) »A. Wan Diene, Y. Ndiaye (SDDS-2000 RP* & AMOS-SDDS & SD-AMOS) »R. Moussa (RS-Parity Subsystem) –Master Student Thesis »At CERIA and coop. Universities –See Ceria Web page ceria.dauphine.fr n Partial Support for SDDS-2000 research –HPL, IBM Research, Microsoft Research
132
132 Problems & Exercices n Install SDDS-2000 and experiment with the interactive appl. Comment the experience on a few pages. n Create your favorite appl. Using the.h interfaces provided with the package. n Comment on on a few pages on LH*rs goal and way to work, on the basis of the Sigmod paper and the WDAS-2002 paper by Rim Moussa. n Comment on a few pages on the strategies for a scalable distributed hash joins according to D. Schneider & al paper. Your own ? n Can you propose how to deal with Theta-joins ? n You should split a table under one SQL Server into two tables on two SQL Servers. You wish to generate RP* like range partitioning on the key attribute(s). You wish to use as much as possible standard SQL queries –How can you find the record with median (middle) key ? –What are the catalog tables where you can find the table size, the associated check constraints, indexes, triggers, stored procedures… to move to the new node –How will you move the records that should leave to the new node. n Idem for a stored funciton under Amos n Idem for Oracle n Idem for DB2
133
133 ENDEND Thank you for your attention Witold Litwin litwin@dauphine.fr wlitwin@cs.berkeley.edu
134
134
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.