1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

Slides:

Advertisements

Similar presentations

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis Faculty of Computer Science, Institute of System Architecture,

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Order Statistics Sorted

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Introduction to Histograms Presented By: Laukik Chitnis

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

Mining Data Streams.

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

Statistics review of basic probability and statistics.

1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Infinite Horizon Problems

Planning under Uncertainty

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Query Optimization Allison Griffin. Importance of Optimization Time is money Queries are faster Helps everyone who uses the server Solution to speed lies.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)

 1  Outline  stages and topics in simulation  generation of random variates.

Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Analysis of Algorithms

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

CS4432: Database Systems II Query Processing- Part 2.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Frequency Counts over Data Streams

Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving.

Updating SF-Tree Speaker: Ho Wai Shing.

A paper on Join Synopses for Approximate Query Answering

Spatial Online Sampling and Aggregation

AQUA: Approximate Query Answering

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Approximation and Load Shedding Sampling Methods

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA

2 Acknowledgements: Kevin Beyer Paul Brown Rainer Gemulla (TU Dresden) Wolfgang Lehner (TU Dresden) Berthold Reinwald Yannis Sismanis

3 Information Discovery for the Enterprise Syndicated Data Provider Crawlable/deep Web Company Data Semi-StructuredUnstructuredStructured Office documents , Product Manuals ECM (reports, spreadsheets, Financial docs (XBRL)) ERP (SAP), CRM, WBI BPM, SCM Business-Object Discovery Query: “Explain the product movement, buyer behavior, maximize the ROI on my product campaigns.” Query: “The sales team is visiting company XYZ next week. What do they need to know about XYZ?” Content Metadata Business objects Enterprise Repository Analyze, Integrate Crawl, ETL Search Business Intelligence Order Account Customer Data Analysis &Similarity

4 Motivation, Continued Challenge: Scalability –Massive amounts of data at high speed Batches and/or streams –Structured, semi-structured, unstructured data Want quick approximate analyses –Automated data integration and schema discovery –“Business object” identification –Quick approximate answers to queries –Data browsing/auditing Our approach: a warehouse of synopses –For scalability and flexibility

5 A Synopsis Warehouse Full-Scale Warehouse Of Data Partitions Synop. S 1,1 S 1,2 S n,m Warehouse of Synopses merge S *,* S 1-2,3-7 etc

6 Outline Synopsis 1: Uniform samples –Background –Creating and combining samples Hybrid Bernoulli and Hybrid Reservoir algorithms –Updating samples Stable datasets: random pairing Growing datasets: resizing algorithms Maintaining Bernoulli samples of multisets Synopsis 2: AKMV samples for DV estimation –Base partitions: KMV synopses DV estimator and properties –Compound partitions: augmentation DV estimator and closure properties

7 Synopsis 1: Uniform Samples Design goals –True uniformity –Bounded memory –Keep sample full –Support for compressed samples 80% of 1000 customer datasets had < 4000 distinct values Uniform Sample Other Synopses Mining AlgorithmsStratified Samples, Etc. x x x x x x x x x x xx x x x x x x x x x Statistical Procedures

8 Classical Uniform Methods Bernoulli sampling –Coin flip: includes each element with prob = q –Random, unbounded (binomial) sample size –Easy to merge: Bern(q)  Bern(q) = Bern(q) Reservoir sampling –Creates uniform sample of fixed size k Insert first k elements into sample Then insert ith element with prob. p i = k / i –Variants and optimizations (e.g., Vitter) –Merging is harder x 6 x 5 x 4 x 3 x 2 x 1 x4x4 x2x2 x1x1 Sample size = 3 Include with prob. 3/5

9 Drawback of Basic Methods Neither method is very compact –Ex: dataset = (, ) –Stored as (A,A,…,A,B,B,…B) chars Concise sampling (GM 98) –Compact: purge Bern(q) sample S if too large Bern(q’/q) subsample of S  Bern(q’) sample –Not uniform (rare items under-represented)

10 New Sampling Methods (ICDE ’06) Two flavors: –Hybrid reservoir (HR) –Hybrid Bernoulli (HB) Properties –Truly uniform –Bounded footprint at all times –Will store exact distribution if possible –Samples stored in compressed form –Merging algorithms available

11 Hybrid Reservoir (HR) Sampling { }+a {,b}+b {, }+b +c {a, } (subsample) {a,b,b} (expand) {c,b,b} (reservoir sampling) +d{c,b,d} Ex: Sample capacity = two pairs or three values +a{a,a,a} { } (compress) { }+a … … done Phase 1 (Maintain exact frequency distribution) Phase 2 (Reservoir sampling) {, }+b

12 Hybrid Bernoulli Similar to Hybrid Reservoir except –Expand into Bernoulli sample in Phase 2 –Revert to Reservoir sample in Phase 3 If termination in Phase 2 –Uniform sample –“Almost” a Bernoulli sample (controllable engineering approximation)

13 Merging Samples Both samples in Phase 2 (usual case) –Bernoulli: equalize q’s and take union Take subsample to equalize q’s –Reservoir: take subsamples and merge Random (hypergeometric) subsample size Corner cases –One sample in Phase 1, etc. –See ICDE ’06 paper for details

14 HB versus HR Advantages: –HB samples are cheaper to merge Disadvantages: –HR sampling controls sample size better –Need to know partition size in advance For subsampling during sample creation –Engineering approximation required

15 Speedup: HB Sampling You derive “speed-up” advantages from parallelism with up to about 100 partitions.

16 Speedup: HR Sampling Similar results to previous slide, but merging HR samples is more complex than HB samples.

17 Linear Scale-Up HB Sampling HR Sampling

18 Updates Within a Partition Arbitrary inserts/deletes (updates trivial) Previous goals still hold –True uniformity –Bounded sample size –Keep sample size close to upper bound Also: minimize/avoid base-data access Sample Partition (updates), deletes, inserts Full-Scale Warehouse X Expensive! Synopsis Warehouse

19 New Algorithms (VLDB ’06+) Stable datasets: Random pairing –Generalizes reservoir/stream sampling Handles deletions Avoids base-data accesses –Dataset insertions paired randomly with “uncompensated deletions” Only requires counters (c g, c b ) of “good” and “bad” UD’s Insert into sample with probability c b / (c b + c g ) –Extended sample-merging algorithm (VLDBJ ’07) Growing datasets: Resizing –Theorem: can’t avoid base-data access –Main ideas: Temporarily convert to Bern(q): may require base-data access Drift up to new size (stay within new footprint at all times) Choose q optimally to reduce overall resizing time –Approximate and Monte Carlo methods

20 Bernoulli Samples of Multisets (PODS ’07) Bernoulli samples over multisets (w. deletions) –When boundedness is not an issue –Compact, easy to parallelize –Problem: how to handle deletions (pairing?) Idea: maintain “tracking counter” –# inserts into DS since first insertion into sample (GM98) Can exploit tracking counter –To estimate frequencies, sums, avgs Unbiased (except avg) and low variance –To estimate # distinct values (!) Maintaining tracking counter –Subsampling: new algorithm –Merging: negative result

21 Outline Synopsis 1: Uniform samples –Background –Creating and combining samples Hybrid Bernoulli and Hybrid Reservoir algorithms –Updating samples Stable datasets: random pairing Growing datasets: resizing algorithms Maintaining Bernoulli samples of multisets Synopsis 2: AKMV samples for DV estimation –Base partitions: KMV synopses DV estimator and properties –Compound partitions: augmentation DV estimator and closure properties

22 AKMV Samples (SIGMOD ’07) Goal: Estimate # distinct values –Dataset similarity (Jaccard distance) –Key detection –Data cleansing Within warehouse framework –Must handle multiset union, intersection, difference

23 KMV Synopsis Used for a base partition Synopsis: k smallest hashed values –vs bitmaps (e.g., logarithmic counting) Need inclusion/exclusion to handle intersection Less accuracy, poor scaling –vs sample counting Random size K (between k/2 and k) –vs Bellman [DJMS02] minHash for k independent hash functions O(k) time per arriving value, vs O(log k) Can view as uniform sample of DV’s

24 The Basic Estimator Estimate: –U (k) = kth smallest (normalized) hashed value Properties (theory of uniform order statistics) –Normalized hashed values “look like” i.i.d. uniform[0,1] RVs Large-D scenario (simpler formulas) –Theorem: U (k) approx.= sum of k i.i.d exp(D) random variables –Analysis coincides with [Cohen97] –Can use simpler formulas to choose synopsis size

25 Compound Partitions Given a multiset expression E –In terms of base partitions A 1,…,A n –Union, intersection, multiset difference Augmented KMV synopsis –KMV synopsis for –Counters: c E (v) = multiplicity of value v in E –AKMV synopses are closed under multiset operations Estimator (unbiased) for # DVs in E: K E = # positive counters

26 Experimental Comparison Unbiased SDLogLog Sample-CountingUnbiased-baseline Absolute Relative Error

27 For More Details "Toward automated large scale information integration and discovery." P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. In Data Management in a Connected World, T. Härder and W. Lehner, eds. Springer-Verlag, “Techniques for warehousing of sample data”. P. G. Brown and P. J. Haas. ICDE ‘06. “A dip in the reservoir: maintaining sample synopses of evolving datasets”. R. Gemulla, W. Lehner, and P. J. Haas. VLDB ‘06. “Maintaining Bernoulli samples over evolving multisets”. R. Gemulla, W. Lehner, and P. J. Haas. PODS ‘07. “On synopses for distinct-value estimation under multiset operations” K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. SIGMOD ‘07. “Maintaining bounded-size sample synopses of evolving multisets” R. Gemulla, W. Lehner, P. J. Haas. VLDB Journal, 2007.

28 Backup Slides

29 Bernoulli Sampling –Bern(q) independently includes each element with probability q –Random, uncontrollable sample size –Easy to merge Bernoulli samples: union of 2 Bern(q) samp’s = Bern(q) +t 1 2/3 1/3 1 +t 2 2 2/31/ /31/3 30% /31/32/31/32/31/32/31/3 +t 3 15% 7%15%7% 4% q = 1/3

30 Reservoir Sampling (Example) Sample size M = t 1 +t 2 100% 12 1/ /3 1/3 12 +t 1 +t 2 +t 3 33%

31 Concise-Sampling Example Dataset –D = { a, a, a, b, b, b } Footprint –F = one pair Three (possible) samples of size = 3 –S 1 = { a, a, a }, S 2 = { b, b, b }, S 3 = { a, a, b }. –S 1 = { }, S 2 = { }, S 3 = {, }. Three samples should have equal likelihood –But Prob(S 1 ) = Prob(S 2 ) > 0 and Prob(S 3 ) = 0 In general: –Concise sampling under-represents ‘rare’ population elements

32 Hybrid Bernoulli Algorithm Phase 1 –Start by storing 100% sample compactly –Termination in Phase 1  exact distribution Abandon Phase 1 if footprint too big –Take subsample and expand –Fall back to Bernoulli sampling (Phase 2) –If footprint exceed: revert to reservoir sampling (Phase 3) Compress sample upon termination If Phase 2 termination: (almost) Bernoulli sample If Phase 3 termination: Bounded reservoir sample Stay within footprint at all times –Messy details

33 Subsampling in HB Algorithm Goal: find q such that P{|S| > n F } = p Solve numerically: Approximate solution (< 3% error):

34 Merging HB Samples If both samples in Phase 2 –Choose q as before (w.r.t. |D 1 U D 2 |) –Convert both samples to compressed Bern(q) [Use Bern(q’/q) trick as in Concise Sampling] –If union of compressed samples fits in memory then join and exit else use reservoir sampling (unlikely)

35 Merging a Pair of HR Samples If both samples in Phase 2 –Set k = min(|S 1 |, |S 2 |) –Select L elements from S 1 and k – L from S 2 L has hypergeometric distribution on {0,1,…,k} –Distribution depends on |D 1 |, |D 2 | Take (compressed) reservoir subsamples of S 1, S 2 Join (compressed union) and exit

36 Generating Realizations of L L is a random variable with probability mass function P(l) = P{ L=l } given by: for l = 0, 1, …. k-1 Simplest implementation –Compute P recursively –Use inversion method (probe cumulative distribution at each merge) Optimizations when |D|’s and |S|’s unchanging – Use alias methods to generate L from cached distributions in O(1) time

37 Naïve/Prior Approaches unstableconduct deletions, continue with smaller sample (RS with deletions) Comments Technique Algorithm expensive, low space efficiency in our setting tailored for multiset populations Distinct-value sampling special case of our RP algorithm developed for data streams (sliding windows only) Passive sampling Not uniform (!)“coin flip” sampling with deletions, purge if too large Bernoulli sampling with purging stable but expensiveimmediately sample from base data to refill the sample CAR(WOR) expensive, unstablelet sample size decrease, but occasionally recompute RS with resampling not uniformuse insertions to immediately refill the sample Naïve Counting samples Not uniform Modification of concise sampling

38 Random Pairing

39 Performance

40 A Negative Result Theorem –Any resizing algorithm MUST access base data Example –data set –samples of size 2 –new data set –samples of size 3 Not uniform!

41 Resizing: Phase 1 Conversion to Bernoulli sample –Given q, randomly determine sample size U = Binomial(|D|,q) –Reuse S to create Bernoulli sample Subsample if U < |S| Else sample additional tuples (base data access) –Choice of q small  less base data accesses large  more base data accesses

42 Resizing: Phase 2 Run Bernoulli sampling –Include new tuples with probability q –Delete from sample as necessary –Eventually reach new sample size –Revert to reservoir sampling –Choice of q small  long drift time large  short drift time

43 Choosing q (Inserts Only) Expected Phase 1 (conversion) time Expected Phase 2 (drifting) time Choose q to minimize E[T 1 ] + E[T 2 ]

44 Resizing Behavior Example (dependence on base-access cost): –resize by 30% if sampling fraction drops below 9% –dependent on costs of accessing base data Low costs immediate resizing Moderate costs combined solution High costs degenerates to Bernoulli sampling

45 Choosing q (w. Deletes) Simple approach (insert prob. = p > 0.5) –Expected change in partition size (Phase 2) (p)(1)+(1-p)(-1) = 2p-1 –So scale Phase 2 cost by 1/(2p-1) More sophisticated approach –Hitting time of Markov chain to boundary –Stochastic approximation algorithm Modified Kiefer-Wolfowitz

46 The RPMerge Algorithm Conceptually: defer deletions until after merge Generate Y i ’s directly –Can assume that deletions happen after the insertions

47 New Maintenance Method Idea: use tracking counters –After j-th transaction, augmented sample S j is S j = { ( X j (t),Y j (t) ) : t  T and X j (t) > 0 } X j (t) = frequency of item t in the sample Y j (t) = net # of insertions of t into R since t joined sample Deletion of t Delete t from sample With prob. (X j (t) – 1) / (Y j (t) – 1) Sample Data X j (t) copies of item t in sample N j (t) copies of item t in dataset Insertion of t Insert t into sample With prob. q

48 Frequency Estimation Naïve (Horvitz-Thompson) unbiased estimator Exploit tracking counter: Theorem Can extend to other aggregates (see paper)

49 Estimating Distinct-Value Counts If usual DV estimators unavailable (BH+07) Obtain S’ from S: insert t  D(S) with probability Can show: P(t  S’) = q for t  D(R) HT unbiased estimator: = |S’| / q Improve via conditioning ( Var[E[U|V]] ≤ Var[U] ):

50 Estimating the DV Count Exact computation via sorting –Usually infeasible Sampling-based estimation –Very hard problem (need large samples) Probabilistic counting schemes –Single-pass, bounded memory –Several flavors (mostly bit-vector synopses) Linear counting (ASW87) Logarithmic counting (FM85,WVT90,AMS, DF03) Sample counting (ASW87,Gi01, BJKST02)

51 Intuition Look at spacings –Example with k = 4 and D = 7: –E[V]  1 / D so that D  1 / E[V] –Estimate D as 1 / Avg(V 1,…,V k ) –I.e., as k / Sum(V 1,…,V k ) –I.e., as k / u (k) –Upward bias (Jensen’s inequality) so change k to k-1