Scalable Approximate Query Processing Florin Rusu
Data Explosion Data storage advancements – Price / capacity ($70 / 1 TB) Human generated – Web 2.0 & social networking User data – Communication Network & web logs (eBay – 50 TB / day) Call Detail Records (CDRs) Scientific experiments – LHC (Large Hadron Collider) – SKA (Square Kilometer Array) – 1 EB (10 18 ) / day – Sensor networks 04/19/20102
Large-Scale Data Analytics Traditional DB (OLTP) – Multi-user transaction processing – Optimized for specific workloads (views, indexes, …) Analytic processing (OLAP) – Data cubes Aggregate at different hierarchical levels Pre-defined aggregates, not flexible – Shared-nothing architectures (MPP) Startups: Netezza, Greenplum, AsterData, Vertica, … Parallel databases on clusters of computers Storage layer (row store, column store, hybrid) Compression 04/19/20103
Interactive Data Analysis & Exploration Ad-hoc queries Compute statistical aggregates over all data Example: web log analysis – Documents (URL, Content) – UserVisits (IP, URL, Date, Duration) – “How much time did users spend searching for cars during the period May – July 2009?” SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20104
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/20105
Query Execution URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Selections push down Sort-Merge Join Aggregate SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20106
Selection URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Storage manager One thread for each table scan Project unused columns SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20107
Tuples are pipelined into join Selection URL A B C E F G I J Duration A45 B60 J30 D90 F15 G10 E20 E35 B25 J35 I25 D40 C50 H75 G90 F5 σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20108
URLDuration A45 B60 J30 D90 F15 G10 E20 E35 Sort tuples on join attribute Write sorted runs to disk Buffer space: UV(8) Sort-Merge Join – Sort Phase σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] URL A B C E F G I J Run 1 URLDuration A45 B60 D90 E20 E35 F15 G10 J30 URLDuration B25 J35 I25 D40 C50 H75 G90 F5 Run 2 URLDuration B25 C50 D40 F5 G90 H75 I25 J35 04/19/20109
Sort-Merge Join – Merge Phase SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] Run 1 URLDuration D90 E20 E35 F15 G10 J30 Run 2 URLDuration C50 D40 F5 G90 H75 I25 J35 URL B C E F G I J Run URLDuration B25 B60 URLDuration A45 URL A Duration 45 σ UV σ D ⋈ Σ 04/19/201010
Sort-Merge Join – Merge Phase SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] Run 1 URLDuration F15 G10 J30 Run 2 URLDuration G90 H75 I25 J35 URLDuration E20 E35 F5 URL E Duration D40 D90 σ UV σ D ⋈ Σ 04/19/ URL F G I J Run
Duration 0 45 Update the sum as tuples are produced Aggregation Duration 45 σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201012
Duration Duration 445 Final Result σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201013
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201014
What is the problem? TPC-H benchmark results (price / performance) – 10 TB scale 928 hard-disks (90 TB total storage capacity) 16 × quad-core processors 512 GB RAM $1.5 million – Load time: 55 hours – Q1: linear scan over one table with aggregates on top 1 query: 19 minutes 9 queries: 3 hours (linear scaling) 04/19/201015
Approximate Query Processing Time Query result Traditional query processing Result estimate Confidence bounds SELECT SUM f(r 1 r 2 … r n ) FROM R 1 as r 1, R 2 as r 2, …, R n as r n 04/19/201016
DBO System Architecture [Rusu et al. 2008] σ UV σ D ⋈ Σ DB Engine QueryResult Levelwise Step Controller In-Memory Join ⋈ UV'D' Estimation Module Result Confidence bounds Approximate answer /19/201017
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201018
Sampling [Dobra, Jermaine, Rusu & Xu 2009] URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Control, coordinate & schedule data flow between operators Embed randomness in each operator SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201019
URL J68 F220 C312 H389 Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URLDuration A4570 B60140 J30185 D90252 URL J In-Memory Join URL J F220 C312 A389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL F220 C312 A389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration B60140 J30185 D90252 F15358 URLDuration A45 In-Memory Join URL J
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL F220 C312 H389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration D90252 F15358 G10409 E20476 URLDuration J30 URLDuration J30 In-Memory Join URL J
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL G515 I695 E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration B25722 J35739 I25745 D40791 URLDuration J30 F15 In-Memory Join URL J F C A B 50% input: 360; [-328, 1048] 95% probability
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration I25745 D40791 C50798 H75837 URLDuration J30 F15 B25 J35 In-Memory Join URL J F C A B G I Exceed In-Memory Join capacity (10 tuples)! Eliminate tuples such that variance is minimized.
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration I25745 D40791 C50798 H75837 URLDuration J30 B25 J35 In-Memory Join URL J A B G 74% input: 258; [-293, 808] 95% probability
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration URLDuration J30 B25 J35 G90 In-Memory Join URL J A B G E All input: 448; [3, 892] 95% probability
Sampling Estimation – Intermediate Levels Query result estimator & variance estimator computed from result tuples found by In-Memory Join Confidence bounds derived with Central Limit Theorem Solve optimization problem to keep bounds stable when tuples are deleted from In-Memory Join 04/19/201027
Sort tuples on random function of join attribute Sampling – Join (Sort) σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] URL J888 F67 C489 A227 B987 G51 I342 E739 Run 1 URL F67 A227 C489 J888 Run 2 URL G51 I342 E739 B987 URLDuration A45227 B60987 J30888 D9043 F1567 G1051 E20739 E35739 B25987 J35888 I25342 D4043 C50489 H75150 G9051 F567 URLDuration D9043 G1051 F1567 A45227 E20739 E35739 J30888 B60987 Run 1 URLDuration D4043 G9051 F567 H75150 I25342 C50489 J35888 B25987 Run 2 04/19/201028
Duration 00 Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] σ UV σ D ⋈ Σ URLDuration G1051 F1567 A45227 E20739 E35739 J30888 B60987 Run 1 URLDuration G9051 F567 H75150 I25342 C50489 J35888 B25987 Run 2 Run 1 URL F67 A227 C489 J888 Run 2 URL G51 I342 E739 B987 URLDuration G1051 G9051 URL G51 F67 URL G51 URLDuration G1051 G9051 Duration In-Memory Join Duration /19/2010
Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] σ UV σ D ⋈ Σ URLDuration E20739 E35739 J30888 B60987 Run 1 URLDuration C50489 J35888 B25987 Run 2 Run 1 URL C489 J888 Run 2 URL E739 B987 URLDuration C50489 E20739 E35739 URL C489 E739 URL C489 URLDuration C50489 Duration In-Memory Join Duration % input: 468; [194, 741] 95% probability 04/19/201030
Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] σ UV σ D ⋈ Σ URLDuration B60987 Run 1 URLDuration B25987 Run 2Run 1 URL Run 2 URL B987 URLDuration B25987 B60987 URL B987 URL B987 URLDuration B25987 B60987 Duration In-Memory Join Duration /19/201031
Sampling Estimation – Upper Level Bernoulli sampling with probability given by domain fraction seen so far Consolidate tuples generated by same join key Solve optimization problem to minimize variance across levels – Keep confidence bounds stable 04/19/201032
Contributions Design & implement DBO, first online analytical processing engine – Provide estimates & confidence bounds throughout entire query execution – SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries over any number of relations Design & analyze fastest convergent estimation method for online aggregation – Statistics & optimization techniques 04/19/201033
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201034
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Build sketches on join attribute while data is read from disk Use attributes in aggregate SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201035
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S1 000 ABCDEFGHIJ ABCDEFGHIJ URL A 123 S /19/201036
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S1 100 ABCDEFGHIJ ABCDEFGHIJ URLDuration A S S /19/201037
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S ABCDEFGHIJ S ABCDEFGHIJ S /19/201038
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S S2 21 S ABCDEFGHIJ S S S ABCDEFGHIJ S S S S S S S1 230 S2 490 S ; [-416, 876] 95% probability 04/19/201039
Sketches Estimation Two random processes – Bucket selection – Sign Sketch update Estimator Confidence bounds – Multiple independent sketches – Chebyshev & Chernoff inequalities (worst-case) – Median Central Limit Theorem, Student-t distribution (statistics) 04/19/201040
Pseudo-Random Number Generators [Rusu & Dobra 2006, 2007b] Detailed comparison of generating schemes – Abstract algebra (orthogonal arrays, vector spaces, prime & extension fields) Degree of independence as function of seed size Fast range-summable – Empirical evaluation Generating time is few processor cycles Identify EH3 as generator for sketches – Lowest possible degree of independence – 7.3 ns to generate single number 04/19/201041
Statistical Analysis [Rusu & Dobra 2007a, 2008] Detailed comparison of sketch estimators – Same accuracy (worst-case analysis) – Statistical analysis Distribution (probability density function) Higher frequency moments (kurtosis) Confidence bounds – Empirical evaluation Data skew, correlation, memory usage, update time Identify Fast-AGMS as most reliable scheme – Accurate over entire range of data – Small memory footprint, fast update time 04/19/201042
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201043
Sketches over Samples [Rusu & Dobra 2009] σ UV σ D ⋈ Σ Data is random on disk Build sketches on join attribute while data is read from disk Use attributes in aggregate Provide estimates at any point SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] URLContent Jcar F C Dphone Acar B G HPC Icar E IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F /19/201044
Sketches over Samples SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S S2 01 S ABCDEFGHIJ S S S ABCDEFGHIJ S S S S S S URLContent Jcar F C Dphone Acar B G HPC Icar E S S2 360 S % input: 100; [-2382, 2582] 95% probability 04/19/201045
Sketches over Samples – Estimation Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain – Sketches – random projection, frequency domain – Consider correlation between multiple sketches that share same sample – Moment generating functions Generic analysis independent of sampling process – Bernoulli sampling – Sampling without replacement – Sampling with replacement 04/19/201046
Sketches over Samples – Analysis Var[sketch over samples] = Var[samples] + Var[sketch] + Var[interaction] 04/19/201047
Conclusions Data explosion – Cheap, high-capacity storage – Current processing technology is too expensive for performance it provides Framework for online analytical processing – DBO system architecture Embed randomization into data processing Provide estimates and bounds at any time – Approximation methods Sampling – most flexible Sketches – single pass Sketches over samples – fastest 04/19/201048
Future Work Short term – Define & design query optimization for DBO – Extend DBO to other types of queries and with other approximation techniques (end-biased samples, histograms, …) – Generalize sketches to multiple relations – Find optimal amount of data to sketch – Fully integrate sketches into DBO system Medium term – Develop data aggregation & approximation techniques for other types of architectures Multicore processors, GPUs Distributed processing (Map-Reduce, Hadoop, …) Long term – Design & build scalable analytic processing system Aggregation & approximation 04/19/201049
Publications A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate Convergence in DBO. In VLDB F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo). F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol. 33, no. 3, F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-Based Estimations. In TODS, vol. 32, no. 2, F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In SIGMOD F. Rusu and A. Dobra – Fast Range-Summable Random Variables for Efficient Aggregate Estimation. In SIGMOD /19/201050