Scalable Approximate Query Processing Florin Rusu.

Slides:



Advertisements
Similar presentations
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
CS 540 Database Management Systems
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Introduction to Histograms Presented By: Laukik Chitnis
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Parallel Database Systems
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Ch 4. The Evolution of Analytic Scalability
Cloud Computing Lecture Column Store – alternative organization for big relational data.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Introduction to Hadoop and HDFS
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
EN : Adv. Storage and TP Systems Cost-Based Query Optimization.
Database Management 9. course. Execution of queries.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Authors: Haiquan (Chuck) Zhao, Hao Wang, Bill Lin, Jun (Jim) Xu Conf. : The 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Histograms for Selectivity Estimation
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
CS4432: Database Systems II Query Processing- Part 2.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Data Mining: Concepts and Techniques Mining data streams
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Chapter 13: Query Processing
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
CSCI5570 Large Scale Data Processing Systems
Parallel Databases.
Streaming & sampling.
Chapter 15 QUERY EXECUTION.
Spatial Online Sampling and Aggregation
Ch 4. The Evolution of Analytic Scalability
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Performance And Scalability In Oracle9i And SQL Server 2000
Presentation transcript:

Scalable Approximate Query Processing Florin Rusu

Data Explosion Data storage advancements – Price / capacity ($70 / 1 TB) Human generated – Web 2.0 & social networking User data – Communication Network & web logs (eBay – 50 TB / day) Call Detail Records (CDRs) Scientific experiments – LHC (Large Hadron Collider) – SKA (Square Kilometer Array) – 1 EB (10 18 ) / day – Sensor networks 04/19/20102

Large-Scale Data Analytics Traditional DB (OLTP) – Multi-user transaction processing – Optimized for specific workloads (views, indexes, …) Analytic processing (OLAP) – Data cubes Aggregate at different hierarchical levels Pre-defined aggregates, not flexible – Shared-nothing architectures (MPP) Startups: Netezza, Greenplum, AsterData, Vertica, … Parallel databases on clusters of computers Storage layer (row store, column store, hybrid) Compression 04/19/20103

Interactive Data Analysis & Exploration Ad-hoc queries Compute statistical aggregates over all data Example: web log analysis – Documents (URL, Content) – UserVisits (IP, URL, Date, Duration) – “How much time did users spend searching for cars during the period May – July 2009?” SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20104

Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/20105

Query Execution URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Selections push down Sort-Merge Join Aggregate SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20106

Selection URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Storage manager One thread for each table scan Project unused columns SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20107

Tuples are pipelined into join Selection URL A B C E F G I J Duration A45 B60 J30 D90 F15 G10 E20 E35 B25 J35 I25 D40 C50 H75 G90 F5 σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/20108

URLDuration A45 B60 J30 D90 F15 G10 E20 E35 Sort tuples on join attribute Write sorted runs to disk Buffer space: UV(8) Sort-Merge Join – Sort Phase σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] URL A B C E F G I J Run 1 URLDuration A45 B60 D90 E20 E35 F15 G10 J30 URLDuration B25 J35 I25 D40 C50 H75 G90 F5 Run 2 URLDuration B25 C50 D40 F5 G90 H75 I25 J35 04/19/20109

Sort-Merge Join – Merge Phase SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] Run 1 URLDuration D90 E20 E35 F15 G10 J30 Run 2 URLDuration C50 D40 F5 G90 H75 I25 J35 URL B C E F G I J Run URLDuration B25 B60 URLDuration A45 URL A Duration 45 σ UV σ D ⋈ Σ 04/19/201010

Sort-Merge Join – Merge Phase SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] Run 1 URLDuration F15 G10 J30 Run 2 URLDuration G90 H75 I25 J35 URLDuration E20 E35 F5 URL E Duration D40 D90 σ UV σ D ⋈ Σ 04/19/ URL F G I J Run

Duration 0 45 Update the sum as tuples are produced Aggregation Duration 45 σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201012

Duration Duration 445 Final Result σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201013

Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201014

What is the problem? TPC-H benchmark results (price / performance) – 10 TB scale 928 hard-disks (90 TB total storage capacity) 16 × quad-core processors 512 GB RAM $1.5 million – Load time: 55 hours – Q1: linear scan over one table with aggregates on top 1 query: 19 minutes 9 queries: 3 hours (linear scaling) 04/19/201015

Approximate Query Processing Time Query result Traditional query processing Result estimate Confidence bounds SELECT SUM f(r 1 r 2 … r n ) FROM R 1 as r 1, R 2 as r 2, …, R n as r n 04/19/201016

DBO System Architecture [Rusu et al. 2008] σ UV σ D ⋈ Σ DB Engine QueryResult Levelwise Step Controller In-Memory Join ⋈ UV'D' Estimation Module Result Confidence bounds Approximate answer /19/201017

Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201018

Sampling [Dobra, Jermaine, Rusu & Xu 2009] URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Control, coordinate & schedule data flow between operators Embed randomness in each operator SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201019

URL J68 F220 C312 H389 Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URLDuration A4570 B60140 J30185 D90252 URL J In-Memory Join URL J F220 C312 A389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators

Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL F220 C312 A389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration B60140 J30185 D90252 F15358 URLDuration A45 In-Memory Join URL J

Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL F220 C312 H389 B447 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration D90252 F15358 G10409 E20476 URLDuration J30 URLDuration J30 In-Memory Join URL J

Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL G515 I695 E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration B25722 J35739 I25745 D40791 URLDuration J30 F15 In-Memory Join URL J F C A B 50% input: 360; [-328, 1048] 95% probability

Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration I25745 D40791 C50798 H75837 URLDuration J30 F15 B25 J35 In-Memory Join URL J F C A B G I Exceed In-Memory Join capacity (10 tuples)! Eliminate tuples such that variance is minimized.

Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F URL E799 σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration I25745 D40791 C50798 H75837 URLDuration J30 B25 J35 In-Memory Join URL J A B G 74% input: 258; [-293, 808] 95% probability

Sampling – Selection URLContent Jcar68 Fcar220 Ccar312 DphoneX Acar389 Bcar447 Gcar515 HPCX Icar695 Ecar799 SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I X 2A X 2F G E E C X 3B J I D C H H X 4G F σ UV σ D ⋈ Σ Data in random order Assign random timestamp to tuples Controller schedules data flow between operators URLDuration URLDuration J30 B25 J35 G90 In-Memory Join URL J A B G E All input: 448; [3, 892] 95% probability

Sampling Estimation – Intermediate Levels Query result estimator & variance estimator computed from result tuples found by In-Memory Join Confidence bounds derived with Central Limit Theorem Solve optimization problem to keep bounds stable when tuples are deleted from In-Memory Join 04/19/201027

Sort tuples on random function of join attribute Sampling – Join (Sort) σ UV σ D ⋈ Σ SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] URL J888 F67 C489 A227 B987 G51 I342 E739 Run 1 URL F67 A227 C489 J888 Run 2 URL G51 I342 E739 B987 URLDuration A45227 B60987 J30888 D9043 F1567 G1051 E20739 E35739 B25987 J35888 I25342 D4043 C50489 H75150 G9051 F567 URLDuration D9043 G1051 F1567 A45227 E20739 E35739 J30888 B60987 Run 1 URLDuration D4043 G9051 F567 H75150 I25342 C50489 J35888 B25987 Run 2 04/19/201028

Duration 00 Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] σ UV σ D ⋈ Σ URLDuration G1051 F1567 A45227 E20739 E35739 J30888 B60987 Run 1 URLDuration G9051 F567 H75150 I25342 C50489 J35888 B25987 Run 2 Run 1 URL F67 A227 C489 J888 Run 2 URL G51 I342 E739 B987 URLDuration G1051 G9051 URL G51 F67 URL G51 URLDuration G1051 G9051 Duration In-Memory Join Duration /19/2010

Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] σ UV σ D ⋈ Σ URLDuration E20739 E35739 J30888 B60987 Run 1 URLDuration C50489 J35888 B25987 Run 2 Run 1 URL C489 J888 Run 2 URL E739 B987 URLDuration C50489 E20739 E35739 URL C489 E739 URL C489 URLDuration C50489 Duration In-Memory Join Duration % input: 468; [194, 741] 95% probability 04/19/201030

Sampling – Join (Merge) SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] σ UV σ D ⋈ Σ URLDuration B60987 Run 1 URLDuration B25987 Run 2Run 1 URL Run 2 URL B987 URLDuration B25987 B60987 URL B987 URL B987 URLDuration B25987 B60987 Duration In-Memory Join Duration /19/201031

Sampling Estimation – Upper Level Bernoulli sampling with probability given by domain fraction seen so far Consolidate tuples generated by same join key Solve optimization problem to minimize variance across levels – Keep confidence bounds stable 04/19/201032

Contributions Design & implement DBO, first online analytical processing engine – Provide estimates & confidence bounds throughout entire query execution – SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries over any number of relations Design & analyze fastest convergent estimation method for online aggregation – Statistics & optimization techniques 04/19/201033

Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201034

Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F σ UV σ D ⋈ Σ Build sketches on join attribute while data is read from disk Use attributes in aggregate SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] 04/19/201035

Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S1 000 ABCDEFGHIJ ABCDEFGHIJ URL A 123 S /19/201036

Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S1 100 ABCDEFGHIJ ABCDEFGHIJ URLDuration A S S /19/201037

Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S ABCDEFGHIJ S ABCDEFGHIJ S /19/201038

Sketches URLContent Acar B C Dphone Ecar F G HPC Icar J SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S S2 21 S ABCDEFGHIJ S S S ABCDEFGHIJ S S S S S S S1 230 S2 490 S ; [-416, 876] 95% probability 04/19/201039

Sketches Estimation Two random processes – Bucket selection – Sign Sketch update Estimator Confidence bounds – Multiple independent sketches – Chebyshev & Chernoff inequalities (worst-case) – Median Central Limit Theorem, Student-t distribution (statistics) 04/19/201040

Pseudo-Random Number Generators [Rusu & Dobra 2006, 2007b] Detailed comparison of generating schemes – Abstract algebra (orthogonal arrays, vector spaces, prime & extension fields) Degree of independence as function of seed size Fast range-summable – Empirical evaluation Generating time is few processor cycles Identify EH3 as generator for sketches – Lowest possible degree of independence – 7.3 ns to generate single number 04/19/201041

Statistical Analysis [Rusu & Dobra 2007a, 2008] Detailed comparison of sketch estimators – Same accuracy (worst-case analysis) – Statistical analysis Distribution (probability density function) Higher frequency moments (kurtosis) Confidence bounds – Empirical evaluation Data skew, correlation, memory usage, update time Identify Fast-AGMS as most reliable scheme – Accurate over entire range of data – Small memory footprint, fast update time 04/19/201042

Roadmap Database query execution System design & implementation – DataBaseOnline (DBO) Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches – Sketches over samples 04/19/201043

Sketches over Samples [Rusu & Dobra 2009] σ UV σ D ⋈ Σ Data is random on disk Build sketches on join attribute while data is read from disk Use attributes in aggregate Provide estimates at any point SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] URLContent Jcar F C Dphone Acar B G HPC Icar E IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F /19/201044

Sketches over Samples SELECT SUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURL AND D.Content contains ‘car’ AND UV.Date between [ , ] IPURLDateDuration 1A B J D I A F G E E C B J I D C H H G F S S2 01 S ABCDEFGHIJ S S S ABCDEFGHIJ S S S S S S URLContent Jcar F C Dphone Acar B G HPC Icar E S S2 360 S % input: 100; [-2382, 2582] 95% probability 04/19/201045

Sketches over Samples – Estimation Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain – Sketches – random projection, frequency domain – Consider correlation between multiple sketches that share same sample – Moment generating functions Generic analysis independent of sampling process – Bernoulli sampling – Sampling without replacement – Sampling with replacement 04/19/201046

Sketches over Samples – Analysis Var[sketch over samples] = Var[samples] + Var[sketch] + Var[interaction] 04/19/201047

Conclusions Data explosion – Cheap, high-capacity storage – Current processing technology is too expensive for performance it provides Framework for online analytical processing – DBO system architecture Embed randomization into data processing Provide estimates and bounds at any time – Approximation methods Sampling – most flexible Sketches – single pass Sketches over samples – fastest 04/19/201048

Future Work Short term – Define & design query optimization for DBO – Extend DBO to other types of queries and with other approximation techniques (end-biased samples, histograms, …) – Generalize sketches to multiple relations – Find optimal amount of data to sketch – Fully integrate sketches into DBO system Medium term – Develop data aggregation & approximation techniques for other types of architectures Multicore processors, GPUs Distributed processing (Map-Reduce, Hadoop, …) Long term – Design & build scalable analytic processing system Aggregation & approximation 04/19/201049

Publications A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate Convergence in DBO. In VLDB F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo). F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol. 33, no. 3, F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-Based Estimations. In TODS, vol. 32, no. 2, F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In SIGMOD F. Rusu and A. Dobra – Fast Range-Summable Random Variables for Efficient Aggregate Estimation. In SIGMOD /19/201050