Download presentation
Presentation is loading. Please wait.
Published byKristin Thompson Modified over 8 years ago
1
Scalable Approximate Query Processing Florin Rusu
2
Data Explosion Data storage advancements – Price / capacity ($70 / 1 TB) Human generated – Web 2.0 & social networking User data – Communication Network & web logs (eBay – 50 TB / day) Call Detail Records (CDRs) Scientific experiments – LHC (Large Hadron Collider) – SKA (Square Kilometer Array) – 1 EB (10 18 ) / day – Sensor networks 04/19/20102
3
Large-Scale Data Analytics Traditional DB (OLTP) – Multi-user transaction processing – Optimized for specific workloads (views, indexes, …) Analytic processing (OLAP) – Data cubes Aggregate at different hierarchical levels Pre-defined aggregates, not flexible – Shared-nothing architectures (MPP) Startups: Netezza, Greenplum, AsterData, Vertica, … Parallel databases on clusters of computers Storage layer (row store, column store, hybrid) Compression 04/19/20103
4
Interactive Data Analysis & Exploration Ad-hoc queries Compute statistical aggregates over all data Example: web log analysis – Documents (URL, Content) – UserVisits (IP, URL, Date, Duration) – “How much time did users spend searching for cars during the period May – July 2009?” SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/20104
5
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/20105
6
Query Execution URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 σ UV σ D ⋈ Σ Selections push down Sort-Merge Join Aggregate SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/20106
7
Selection URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 σ UV σ D ⋈ Σ Storage manager One thread for each table scan Project unused columns SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/20107
8
Tuples are pipelined into join Selection URL A B C E F G I J Duration A45 B60 J30 D90 F15 G10 E20 E35 B25 J35 I25 D40 C50 H75 G90 F5 σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/20108
9
URLDuration A45 B60 J30 D90 F15 G10 E20 E35 Sort tuples on join attribute Write sorted runs to disk Buffer space: UV(8) Sort-Merge Join – Sort Phase σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] URL A B C E F G I J Run 1 URLDuration A45 B60 D90 E20 E35 F15 G10 J30 URLDuration B25 J35 I25 D40 C50 H75 G90 F5 Run 2 URLDuration B25 C50 D40 F5 G90 H75 I25 J35 04/19/20109
10
Sort-Merge Join – Merge Phase SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] Run 1 URLDuration D90 E20 E35 F15 G10 J30 Run 2 URLDuration C50 D40 F5 G90 H75 I25 J35 URL B C E F G I J Run URLDuration B25 B60 URLDuration A45 URL A Duration 45 σ UV σ D ⋈ Σ 04/19/201010
11
Sort-Merge Join – Merge Phase SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] Run 1 URLDuration F15 G10 J30 Run 2 URLDuration G90 H75 I25 J35 URLDuration E20 E35 F5 URL E Duration D40 D90 σ UV σ D ⋈ Σ 04/19/201011 URL F G I J Run
12
Duration 0 45 Update the sum as tuples are produced Aggregation Duration 45 σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/201012
13
Duration 45 25 60 50 20 35 15 5 10 90 25 30 35 Duration 445 Final Result σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/201013
14
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201014
15
What is the problem? TPC-H benchmark results (price / performance) – 10 TB scale 928 hard-disks (90 TB total storage capacity) 16 × quad-core processors 512 GB RAM $1.5 million – Load time: 55 hours – Q1: linear scan over one table with aggregates on top 1 query: 19 minutes 9 queries: 3 hours (linear scaling) 04/19/201015
16
Approximate Query Processing Time Query result Traditional query processing Result estimate Confidence bounds SELECT SUM f(r 1 r 2 … r n ) FROM R 1 as r 1, R 2 as r 2, …, R n as r n 04/19/201016
17
DBO System Architecture [Rusu et al. 2008] σ UV σ D ⋈ Σ DB Engine QueryResult Levelwise Step Controller In-Memory Join ⋈ UV'D' Estimation Module Result Confidence bounds 12 3 4 5 Approximate answer 6 7 04/19/201017
18
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201018
19
Sampling [Dobra, Jermaine, Rusu & Xu 2009] URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 σ UV σ D ⋈ Σ Control, coordinate & schedule data flow between operators Embed randomness in each operator SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/201019
20
URL J68 F220 C312 H389 Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 URLDuration A4570 B60140 J30185 D90252 URL J In-Memory Join URL J F220 C312 A389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators
21
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 URL F220 C312 A389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration B60140 J30185 D90252 F15358 URLDuration A45 In-Memory Join URL J
22
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 URL F220 C312 H389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration D90252 F15358 G10409 E20476 URLDuration J30 URLDuration J30 In-Memory Join URL J
23
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 URL G515 I695 E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration B25722 J35739 I25745 D40791 URLDuration J30 F15 In-Memory Join URL J F C A B 50% input: 360; [-328, 1048] 95% probability
24
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 URL E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration I25745 D40791 C50798 H75837 URLDuration J30 F15 B25 J35 In-Memory Join URL J F C A B G I Exceed In-Memory Join capacity (10 tuples)! Eliminate tuples such that variance is minimized.
25
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 URL E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration I25745 D40791 C50798 H75837 URLDuration J30 B25 J35 In-Memory Join URL J A B G 74% input: 258; [-293, 808] 95% probability
26
Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-094570 1B06-01-0960140 1J06-01-0930185 1D05-15-0990252 1I04-28-0935X 2A04-30-0960X 2F06-15-0915358 2G06-13-0910409 2E06-01-0920476 2E07-10-0935495 3C04-28-0925X 3B05-23-0925722 3J05-29-0935739 3I06-13-0925745 3D06-09-0940791 4C07-30-0950798 4H05-14-0975837 4H08-02-0965X 4G07-23-0990953 4F06-16-095973 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration URLDuration J30 B25 J35 G90 In-Memory Join URL J A B G E All input: 448; [3, 892] 95% probability
27
Sampling Estimation – Intermediate Levels Query result estimator & variance estimator computed from result tuples found by In-Memory Join Confidence bounds derived with Central Limit Theorem Solve optimization problem to keep bounds stable when tuples are deleted from In-Memory Join 04/19/201027
28
Sort tuples on random function of join attribute Sampling – Join (Sort) σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] URL J888 F67 C489 A227 B987 G51 I342 E739 Run 1 URL F67 A227 C489 J888 Run 2 URL G51 I342 E739 B987 URLDuration A45227 B60987 J30888 D9043 F1567 G1051 E20739 E35739 B25987 J35888 I25342 D4043 C50489 H75150 G9051 F567 URLDuration D9043 G1051 F1567 A45227 E20739 E35739 J30888 B60987 Run 1 URLDuration D4043 G9051 F567 H75150 I25342 C50489 J35888 B25987 Run 2 04/19/201028
29
Duration 00 Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] σ UV σ D ⋈ Σ URLDuration G1051 F1567 A45227 E20739 E35739 J30888 B60987 Run 1 URLDuration G9051 F567 H75150 I25342 C50489 J35888 B25987 Run 2 Run 1 URL F67 A227 C489 J888 Run 2 URL G51 I342 E739 B987 URLDuration G1051 G9051 URL G51 F67 URL G51 URLDuration G1051 G9051 Duration 1051 9051 In-Memory Join Duration 10051 2904/19/2010
30
Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] σ UV σ D ⋈ Σ URLDuration E20739 E35739 J30888 B60987 Run 1 URLDuration C50489 J35888 B25987 Run 2 Run 1 URL C489 J888 Run 2 URL E739 B987 URLDuration C50489 E20739 E35739 URL C489 E739 URL C489 URLDuration C50489 Duration 50489 In-Memory Join Duration 240489 50% input: 468; [194, 741] 95% probability 04/19/201030
31
Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] σ UV σ D ⋈ Σ URLDuration B60987 Run 1 URLDuration B25987 Run 2Run 1 URL Run 2 URL B987 URLDuration B25987 B60987 URL B987 URL B987 URLDuration B25987 B60987 Duration 25987 60987 In-Memory Join Duration 445987 04/19/201031
32
Sampling Estimation – Upper Level Bernoulli sampling with probability given by domain fraction seen so far Consolidate tuples generated by same join key Solve optimization problem to minimize variance across levels – Keep confidence bounds stable 04/19/201032
33
Contributions Design & implement DBO, first online analytical processing engine – Provide estimates & confidence bounds throughout entire query execution – SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries over any number of relations Design & analyze fastest convergent estimation method for online aggregation – Statistics & optimization techniques 04/19/201033
34
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201034
35
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 σ UV σ D ⋈ Σ Build sketches on join attribute while data is read from disk Use attributes in aggregate SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] 04/19/201035
36
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 123 S1 000 ABCDEFGHIJ +----+++-- ABCDEFGHIJ 1231122333 URL A 123 S1 000 123 100 + 1 04/19/201036
37
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 123 S1 100 ABCDEFGHIJ +----+++-- ABCDEFGHIJ 1231122333 URLDuration A45 123 S1 000 123 4500 S1+ 1 04/19/201037
38
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 123 S1 01-3 ABCDEFGHIJ S1 +----+++-- ABCDEFGHIJ 1231122333 123 -14035-65 S1 230 04/19/201038
39
Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 123 S1 01-3 S2 21 S3 -301 ABCDEFGHIJ S1 +----+++-- S2 +-+-+-+-+- S3 ---++-++-+ ABCDEFGHIJ S11231122333 S23321212132 S31121313232 123 S1 -14035-65 S2 -225140-15 S3 -2090130 S1 230 S2 490 S3 190 230; [-416, 876] 95% probability 04/19/201039
40
Sketches Estimation Two random processes – Bucket selection – Sign Sketch update Estimator Confidence bounds – Multiple independent sketches – Chebyshev & Chernoff inequalities (worst-case) – Median Central Limit Theorem, Student-t distribution (statistics) 04/19/201040
41
Pseudo-Random Number Generators [Rusu & Dobra 2006, 2007b] Detailed comparison of generating schemes – Abstract algebra (orthogonal arrays, vector spaces, prime & extension fields) Degree of independence as function of seed size Fast range-summable – Empirical evaluation Generating time is few processor cycles Identify EH3 as generator for sketches – Lowest possible degree of independence – 7.3 ns to generate single number 04/19/201041
42
Statistical Analysis [Rusu & Dobra 2007a, 2008] Detailed comparison of sketch estimators – Same accuracy (worst-case analysis) – Statistical analysis Distribution (probability density function) Higher frequency moments (kurtosis) Confidence bounds – Empirical evaluation Data skew, correlation, memory usage, update time Identify Fast-AGMS as most reliable scheme – Accurate over entire range of data – Small memory footprint, fast update time 04/19/201042
43
Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201043
44
Sketches over Samples [Rusu & Dobra 2009] σ UV σ D ⋈ Σ Data is random on disk Build sketches on join attribute while data is read from disk Use attributes in aggregate Provide estimates at any point SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] URLContent Jcar F C Dphone Acar B G HPC Icar E IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 04/19/201044
45
Sketches over Samples SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [05-01-09, 07-31-09] IPURLDateDuration 1A05-30-0945 1B06-01-0960 1J06-01-0930 1D05-15-0990 1I04-28-0935 2A04-30-0960 2F06-15-0915 2G06-13-0910 2E06-01-0920 2E07-10-0935 3C04-28-0925 3B05-23-0925 3J05-29-0935 3I06-13-0925 3D06-09-0940 4C07-30-0950 4H05-14-0975 4H08-02-0965 4G07-23-0990 4F06-16-095 123 S1 11-2 S2 01 S3 -200 ABCDEFGHIJ S1 +----+++-- S2 +-+-+-+-+- S3 ---++-++-+ ABCDEFGHIJ S11231122333 S23321212132 S31121313232 123 S1 -100-35-30 S2 -10535-15 S3 -303065 URLContent Jcar F C Dphone Acar B G HPC Icar E S1 -300 S2 360 S3 240 50% input: 100; [-2382, 2582] 95% probability 04/19/201045
46
Sketches over Samples – Estimation Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain – Sketches – random projection, frequency domain – Consider correlation between multiple sketches that share same sample – Moment generating functions Generic analysis independent of sampling process – Bernoulli sampling – Sampling without replacement – Sampling with replacement 04/19/201046
47
Sketches over Samples – Analysis Var[sketch over samples] = Var[samples] + Var[sketch] + Var[interaction] 04/19/201047
48
Conclusions Data explosion – Cheap, high-capacity storage – Current processing technology is too expensive for performance it provides Framework for online analytical processing – DBO system architecture Embed randomization into data processing Provide estimates and bounds at any time – Approximation methods Sampling – most flexible Sketches – single pass Sketches over samples – fastest 04/19/201048
49
Future Work Short term – Define & design query optimization for DBO – Extend DBO to other types of queries and with other approximation techniques (end-biased samples, histograms, …) – Generalize sketches to multiple relations – Find optimal amount of data to sketch – Fully integrate sketches into DBO system Medium term – Develop data aggregation & approximation techniques for other types of architectures Multicore processors, GPUs Distributed processing (Map-Reduce, Hadoop, …) Long term – Design & build scalable analytic processing system Aggregation & approximation 04/19/201049
50
Publications A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate Convergence in DBO. In VLDB 2009. F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009. F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo). F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol. 33, no. 3, 2008. F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-Based Estimations. In TODS, vol. 32, no. 2, 2007. F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In SIGMOD 2007. F. Rusu and A. Dobra – Fast Range-Summable Random Variables for Efficient Aggregate Estimation. In SIGMOD 2006. 04/19/201050
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.