Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.

Similar presentations


Presentation on theme: "A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek."— Presentation transcript:

1 A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek (Aster Data)

2 Highly Concurrent Data Warehouses Data analytics is a core service of any DW. High query concurrency is becoming important. At the same time, customers need predictability. – Requirement of actual customer: Increasing concurrency from one query to 40 should not increase latency by more than 6x. 2

3 Shortcoming of Existing Systems DWs employ the query-at-a-time model. – Each query executes as a separate physical plan. Result: Concurrent plans contend for resources. This creates a situation of “workload fear”. 3

4 Our Contribution: CJOIN A novel physical operator for star queries. – Star queries arise frequently in ad-hoc analytics. Main ideas: – A single physical plan for all concurrent queries. – The plan is always ``on’’. – Deep work sharing: I/O, join processing, storage. 4

5 Outline Preliminaries The CJOIN operator Experimental study Conclusions 5

6 Setting We assume a star-schema DW. We target the class of star queries. Goal: Executing efficiently concurrent star queries. – Low latency. – Graceful scale-up. 6

7 Further Assumptions Fact table is too large to fit in main memory. Dimension tables are “small”. – Example from TPC-DS: 2.5GB of dimension data for 1TB warehouse. Indices and materialized views may exist. Workload is volatile. 7

8 Outline Preliminaries The CJOIN operator Experimental study Conclusions 8

9 Design Overview 9 Preprocessor Filter Distributor Filter Optimizer Conventional Query Processor CJOIN Star Queries Other Queries Query Stream

10 Running Example 10 Q1Q1 select COUNT(*) from F join X join Y where φ 1 (X) and ψ 1 (Y) Q2Q2 select SUM(F.m) from F join Y where ψ 2 (Y) Queries Schema Fact Table F m Dimension X Dimension Y join X and TRUE(X)

11 The CJOIN Operator 11 Preprocessor Filter Distributor Filter Fact Table F COUNT SUM Q1Q1 Q2Q2 Continuous Scan

12 The CJOIN Operator 12 Preprocessor Filter Distributor Filter Dimension X Q1Q1 Dimension Y Q 1 ∧ −Q 2 −Q1 ∧ Q2−Q1 ∧ Q2 Q1 ∧ Q2Q1 ∧ Q2 Fact Table F COUNT SUM Q1Q1 Q2Q2 Continuous Scan a a b Q 1 : a Q 2 : b Q1Q1 Q2Q2 11 * * 01 Hash Table X Q1Q1 Q2Q2 10 * * 00 01 11 Hash Table Y Query Start

13 Processing Fact Tuples 13 Preprocessor Filter Distributor Filter Q1Q1 Q2Q2 11 * * 01 Q1Q1 Q2Q2 Q1Q1 Q2Q2 10 * * 00 Fact Table F Q1Q1 Q2Q2 Q1Q1 Q2Q2 COUNT SUM Q1Q1 Q2Q2 01 11 Q1Q1 Q2Q2 11 01 11 01 a a b Q 1 : a Q 2 : b Hash Table XHash Table Y Query Start 0 1 Continuous Scan

14 Registering New Queries 14 Preprocessor Filter Distributor Filter Dimension X Q1Q1 Q1Q1 Q2Q2 11 * * 01 Q1Q1 Q2Q2 Fact Table F Q1Q1 Q2Q2 Q1Q1 Q2Q2 COUNT SUM Q1Q1 Q2Q2 Q1Q1 Q2Q2 10 * * 00 01 11 Q1Q1 Q2Q2 11 11 11 11 a a b Q 1 : a Q 2 : b Hash Table XHash Table Y Query Start Q1Q1 Q2Q2 11 * * 01 01 Q3Q3 0 1 0 Q3Q3 1 1 1 1 Continuous Scan Q3Q3 select AVG(F.m) from F join X where φ 3 (X) join Y and TRUE(Y) select * from X where φ 3 (Χ) −Q 1 ∧ Q 3 ∧ −Q 3

15 Registering New Queries 15 Preprocessor Filter Distributor Filter Q1Q1 Q 2 Q 3 Fact Table F Q1Q1 Q 2 Q 3 Q1Q1 COUNT SUM Q1Q1 Q2Q2 Q1Q1 Q2Q2 10 * * 00 01 11 Q1Q1 Q 2 Q 3 11 0 1 0 a a b Q 1 : a Q 2 : b Hash Table XHash Table Y Query Start Q3Q3 1 1 1 1 c Q 3 : c Begin Q 3 AVG Q3Q3 0 11 Continuous Scan Q1Q1 Q2Q2 11 * * 01 01 Q3Q3 0 1 0 select AVG(F.m) from F join X where φ 3 (X) join Y and TRUE(Y) c:

16 Properties of CJOIN Processing CJOIN enables a deep form of work sharing: – Join computation. – Tuple storage. – I/O. Computational cost per tuple is low. -Hence, CJOIN can sustain a high I/O throughput. Predictable query latency. – Continuous scan can provide a progress indicator. 16

17 Other Details (in the paper) Run-time optimization of Filter ordering. Updates. Implementation on multi-core systems. Extensions: – Column stores. – Fact table partitioning. – Galaxy schemata. 17 Preprocessor Distributor Filter x n

18 Outline Preliminaries The CJOIN operator Experimental study Conclusions 18

19 Experimental Methodology Systems: – CJOIN Prototype on top of Postgres. – Postgres with shared scans enabled. – Commercial system X. We use the Star Schema Benchmark (SSB). – Scale factor = 100 (100GB of data). – Workload comprises parameterized SSB queries. Hardware: – Quad-core Intel Xeon. – 8GB of shared RAM. – RAID-5 array of four 15K RPM SAS disks. 19

20 Effect of Concurrency 20 Throughput increases with more concurrent queries.

21 Response Time Predictability 21 Query latency is predictable; no more workload fear.

22 Influence of Data Scale 22 CJOIN is effective even for small data sets. Concurrency level: 128

23 Related Work Materialized views [R+95,HRU96]. Multiple query Optimization [T88]. Work Sharing. – Staged DBs [HSA05]. – Scan Sharing [F94, Z+07, Q+08]. – Aggregation [CR07]. BLINK [R+08]. Streaming database systems [M+02, B+04]. 23

24 Conclusions High query concurrency is crucial for DWs. Query-at-a-time leads to poor performance. Our solution: CJOIN. – Target: Class of star queries. – Deep work sharing: I/O, join, tuple storage. – Efficient realization on multi-core architectures. Experiments show an order of magnitude improvement over commercial system. 24

25 THANK YOU! http://people.epfl.ch/george.candea http://www.cs.ucsc.edu/~alkis http://www.asterdata.com 25


Download ppt "A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek."

Similar presentations


Ads by Google