Ten Thousand SQLs Kalmesh Nyamagoudar 2010MCS3494
October 13, Example Definitions Algorithm CN Generation Sequential Algorithm CLP : Naïve CLP : New OLP DLP Performance Studies CN Evaluation CONTENTS
October 13, BANKS Model Author1Author2 Paper1 Author1Author2 Paper2 Steiner Trees
October 13, DISCOVER Model Author1Author2 Paper1 TID NAME TID NAME TID AID PID TID PID1 PID2 AUTHORWRITES PAPERCITE Writes {} Paper {} Writes {} Joining Network Of Tuples Joining Network Of Tuple Sets Author1: Paper1 Author2: Paper1 Author1Author2 Paper2 Author1: Paper2 Author2: Paper2 Author Author1 Author Author2 Author Author1 Writes {} Paper {} Writes {} Author Author2
5 Background : DISCOVER October 13, 2011
6 Background : DISCOVER Schema Graph (TPC-H) October 13, 2011
Background : DISCOVER 7 Example Data Source : Discover[3] October 13, 2011
Background : DISCOVER 8 Query: Smith,Miller” Source : Discover[3] October 13, 2011
9 Source : Discover[3] Background : DISCOVER Query: Smith,Miller” SIZERESULT 2 O1 C1 O2 October 13, 2011
10 Source : Discover[3] Background : DISCOVER Query: Smith,Miller” SIZERESULT 2 O1 C1 O2 4 O1 C1 N1 C2 O3 Joining Network Of Tuples October 13, 2011
11October 5, 2011 Joining Network Of Tuple Sets Background : DISCOVER Source : Discover[2]
12 Background : DISCOVER October 13, 2011
13 Background : DISCOVER October 13, 2011
14 Candidate Networks Generation Complete : Every possible MTJNT is produced by a candidate network output by the algorithm Minimal : Does not produce any redundant candidate networks Example: ORDERS Smith ⋈ CUSTOMER{} ⋈ ORDERS Miller ORDERS Smith ⋈ CUSTOMER{} ⋈ ORDERS Miller ⋈ CUSTOMER{} ORDERS Smith ⋈ CUSTOMER{} ⋈ ORDERS{} ORDERS Smith ⋈ LINEITEM{} ⋈ ORDERS Miller Tmax : Maximum number of tuple sets in a CN Background : DISCOVER October 13, 2011
15 CN Generation October 13, 2011 Source : Discover[2]
16 CN Generation October 13, 2011 Source : Discover[2]
17 CN Generation October 13, 2011 Source : Discover[2]
18 CN Evaluation : October 13, 2011
Sequential Algorithm : Example 19 Dataset : DBLP Source : TTS[1] TID NAME TID NAME TID AID PID TID PID1 PID2 AUTHORWRITE PAPERCITE October 13, 2011
20 Source : TTS[1] Sequential Algorithm : Example TID NAME TID NAME TID AID PID TID PID1 PID2 AUTHORWRITE PAPERCITE October 13, 2011
CN Evaluation : state-of-art sequential algorithm 21October 13, 2011
22 Source : TTS[1] Sequential Algorithm : Execution Graph October 13, 2011
23 Sequential Algorithm : Execution Graph October 13, 2011
24 New Solution Use of multi-core architecture Why not existing parallel multi-query processing? Large number of queries Large sharing between queries Large intermediate results What we need on multi-core archs? CNs in the same core share : most computational cost CNs in different cores share : least computational cost Handle high workload skew Handle errors caused by estimation adaptively October 13, 2011
25 CN Level Parallelism : Straightforward Approach largest first rule : partition with the least workload Final Cost : max(cost of each core) = 1949 Source : TTS[1] October 13, 2011
26 CLP : Straightforward Approach Source : TTS[1] select the core : O(n) October 13, 2011
27 CLP: Sharing-Aware CN Partitioning Which CN to distribute first? the largest not-shared/extra cost To which partition? with maximum sharing if it does not destroy the workload balancing. Total cost for a partition = cost after sharing sub-expressions for all CNs in that partition October 13, 2011
APPAPP W C CWC C PPP Core 1Core 2Core 3 CNMinCost MaxHeap : Non-Exec Graph of Core 3 October 13,
APPAPP W C CWC C PPP MaxHeap Core 1Core 2Core 3 CNMinCost October 13,
APPP W C WC C PP Core 1Core 2Core 3 CNMinCost MaxHeap October 13,
PPP C WC C P CNMinCost MaxHeap Core 1Core 2Core 3 October 13,
PP WC C CNMinCost Core 1Core 2Core 3 MaxHeap October 13,
33 CLP: Sharing-Aware CN Partitioning Source : TTS[1] October 13, 2011
34 CLP: Sharing-Aware CN Partitioning Source : TTS[1] Initialization October 13, 2011
35 CLP: Error Accumulation Source : TTS[1] October 13, 2011
36 Operator Level Parallelism October 13, 2011
37 Operator Level Parallelism Source : TTS[1] October 13, 2011
38 OLP : Overcoming Error Accumulation October 13, 2011
39 OLP : Overcoming Accumulated Cost Source : TTS[1] October 13, 2011
40 Operator Level Parallelism Source : TTS[1] October 13, 2011
41 Data Level Parallelism each operation in GE can be performed on multiple cores uses the operation level parallelism if there is no workload skew partition data adaptively before each time workload skew happens Which node to partition? Most costly node if its dominant When to merge the sub-results? At final phase October 13, 2011
42 Data Level Parallelism Source : TTS[1] Core 1 Core 2Core 3 October 13, 2011
43 Data Level Parallelism Source : TTS[1] Divide the tuples of child node Select the child node to be partitioned Makes copies of selected child node and all its father nodes Adds corresponding edges Re-estimate October 13, 2011
44 Performance Studies October 13, 2011
45 Source : TTS[1] Performance Studies October 13, 2011
46 Source : TTS[1] October 13, 2011
47 Source : TTS[1] October 13, 2011
48 Source : TTS[1 ] October 13, 2011
References 1. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Ten Thousand SQLs: Parallel Keyword Queries Computing, Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010, Singapore 2. Vagelis Hristidis, Yannis Papakonstantinou, Discover: keyword search in relational databases, VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases, Hong Kong 3. [PPT] DISCOVER: Keyword Search in Relational Databases 49October 13, 2011
50October 13, 2011