Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Active Appearance Models

Design of Experiments Lecture I

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

CS4432: Database Systems II

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

Fast Algorithms For Hierarchical Range Histogram Constructions

Introduction to Histograms Presented By: Laukik Chitnis

Experimental Design, Response Surface Analysis, and Optimization

Chapter 4 Probability and Probability Distributions

Efficient Query Evaluation on Probabilistic Databases

Parameterized Timing Analysis with General Delay Models and Arbitrary Variation Sources Khaled R. Heloue and Farid N. Najm University of Toronto {khaled,

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Radial Basis Function Networks

1 Assessment of Imprecise Reliability Using Efficient Probabilistic Reanalysis Farizal Efstratios Nikolaidis SAE 2007 World Congress.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Copyright © Cengage Learning. All rights reserved.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Privacy-preserving data publishing

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.

1 Travel Times from Mobile Sensors Ram Rajagopal, Raffi Sevlian and Pravin Varaiya University of California, Berkeley Singapore Road Traffic Control TexPoint.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Chapter 13 Simple Linear Regression

Analyzing Graphs of Functions 1.5

Data Transformation: Normalization

Data Mining: Concepts and Techniques

A paper on Join Synopses for Approximate Query Answering

Data Mining K-means Algorithm

Probabilistic Data Management

Data Mining Lecture 11.

Probabilistic Data Management

Data Integration with Dependent Sources

Introduction Second report for TEGoVA ‘Assessing the Accuracy of Individual Property Values Estimated by Automated Valuation Models’ Objective.

Probabilistic Data Management

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Probabilistic Databases

Presentation transcript:

Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis Technical University of Crete Andrew McGregor University of Massachusetts, Amherst

2 Talk Outline  The need for probabilistic histograms - Sources and hardness of probabilistic data - Problem definition, interesting metrics  Proposed Solution  Query Processing Using Probabilistic Histograms - Selections, Joins, Aggregation etc  Experimental study  Conclusions and Future Directions

3 Sources of Probabilistic Data  Increasingly data is uncertain and imprecise - Data collected from sensors has errors and imprecisions - Record linkage has confidence of matches - Learning yields probabilistic rules  Recent efforts to build uncertainty into the DBMS - Mystiq, Orion, Trio, MCDB and MayBMS projects - Model uncertainty and correlations within tuples Attribute values using probabilistic distribution over mutually exclusive alternatives Assume independence across tuples - Aim to allow general purpose queries over uncertain data Selections, Joins, Aggregations etc

4 Probabilistic Data Reduction  Probabilistic data can be difficult to work with - Even simple queries can be #P hard [Dalvi, Suciu ’04] joins and projections between (statistically) independent probabilistic relations need to track the history of generated tuples - Want to avoid materializing all possible worlds  Seek compact representations of probabilistic data - Data synopses which capture key properties - Can perform expensive operations on compact summaries

5 Shortcomings of Prior Approaches  [CG’09] builds histograms that minimize the expectation of a given error metric - Domain split in buckets - Each bucket approximated by a single value  Too much information lost in this process - Expected frequency of an item tells us little about its probability that it will appear i times How to do joins, or selections based on frequency?  Not a complete representation scheme - Given maximum space, input representation cannot be fully captured

6 Our Contribution  A more powerful representation of uncertain data  Represent each bucket with a PDF - Capture prob. of each item appearing i times  Complete representation  Target several metrics - EMD, Kullback-Leibler divergence, Hellinger Distance - Max Error, Variation Distance (L1), Sum Squared Error etc

7 Talk Outline  The need for probabilistic histograms - Sources and hardness of probabilistic data - Problem definition, interesting metrics  Proposed Solution  Query Processing Using Probabilistic Histograms - Selections, Joins, Aggregation etc  Experimental study  Conclusions and Future Directions

8 Probabilistic Data Model  Ordered domain U of data items (i.e., {1, 2, …, N})  Each item in U obtains values from a value domain V - Each with different frequency  each item described by PDF  Example: - PDF of item i describes prob. that i appears 0, 1, 2, … times - PDF of item i describes prob. that i measured value V 1, V 2 etc

9 Used Representation  Goal: Participate U domain into buckets  Within each bucket b = (s,e) - Approximate (e-s+1) pdfs with a piece-wise constant PDF X(b)  Error of above approximation - Let d() denote a distance function of PDFs  Given a space bound, we need to determine - number of buckets - terms (i.e., pdf complexity) in each bucket Start: s End: e of bucket Typically, summation or MAX

10 Targeted Error Metrics Variation Distance (L1) Sum Squared Error Max Error (L  ) (Squared) Hellinger Distance Kullback-Leibler Divergence (relative entropy) Earth Mover’s Distance (EMD) Distance between probabilities at the value domain Common Prob. metrics

11 General DP Scheme: Inter-Bucket  Let B-OPT b [w,T] represent error of approximating up to w  V first values of bucket b using T terms  Let H-OPT[m, T] represent error of first m items in U when using T terms Where the last bucket starts Use T-t terms for the first k items Approximate all V+1 frequency values using t terms w Error approximating first w values of PDFS within bucket b Using T terms for bucket b Check all start positions of last bucket, terms to assign

12 General DP Scheme: Intra-Bucket  Each bucket b=(s,e) summarizes PDFs of items s,…,e - Using from 1 to V=| V | terms  Let VALERR(b,u,v) denotes minimum possible error of approximating the frequency values in [u,v] of bucket b. Then:  Intra-Bucket DP not needed for MAX Error (L  ) distance  Compute efficiently per metric  Utilize pre-computations Where the last term starts Use T-1 terms for the first u frequency values of bucket

13 Sum Squared Error & (Squared) Hellinger Distance  Simpler cases (solved similarly). Assume bucket b=(s,e) and wanting to compute VALERR(b,v,w)  (Squared) Hellinger Distance (SSE is similar) - Represent bucket [s,e]x[v,w] by single value p, where - VALERR(b,v,w) = - VALERR computed in constant time using O(UV) pre- computed values, given Computed by 4 A[ ] entries Computed by 4 B[ ] entries

 Interesting case, several variations  Best representative within a bucket = median P value  , where  Need to calculate sum of values below median  two-dimensional range-sum median problem  Optimal PDF generated is NOT normalized  Normalized PDF produced by scaling = factor of 2 from optimal  Extensions for ε-error (normalized) approximation 14 Variation Distance

15 Other Distance Metrics  Max-Error can be minimized efficiently using sophisticated pre-computations - No Intra-Bucket DP needed - Complexity lower than all other metrics: O(TVN 2 )  EMD case is more difficult (and costly) to handle  Details in the paper…

16 Handling Selections and Joins  Simple statistics such as expectation are simple  Selections on item domain are straightforward - Discard irrelevant buckets - Result is itself a prob. histogram  Selections on the value domain are more challenging - Correspond to extracting the distribution conditioned on selection criteria  Range predicates are clean: result is a probabilistic histogram of approximately same size X Pr Pr[X=x | X ≥ 3] X Pr 1/2 1/3 1/6

17 Handling Joins and Aggregates  Result of joining two probabilistic relations can be represented by joining their histograms - Assume pdfs of each relation are independent - Ex: equijoin on V : Form join by taking product of pdfs for each pair of bucket intersections - If input histograms have B1, B2 buckets respectively, the result has at most B1+B2-1 buckets Each bucket has at most: T1+T2-1 terms  Aggregate queries also supported - I.e., count(#tuples) in result - Details in the paper… X Pr X Join on V boundaries X Pr Product of

18 Experimental Study  Evaluated on two probabilistic data sets - Real data from Mystiq Project (127k tuples, 27,700 items) - Synthetic data from MayBMS generator (30K items)  Competitive technique considered: IDEAL-1TERM - One bucket per EACH item (i.e., no space bound) - A single term per bucket  Investigated: - Scalability of PHist for each metric - Error compared to IDEAL-1TERM

19 Quality of Probabilistic Histograms  Clear benefit when compared to IDEAL-1TERM - PHist able to approximate full distribution

20 Scalability - Time cost is linear in T, quadratic in N Variation Distance (almost cubic complexity in N) scales poorly - Observe “knee” in right figure. Cost of buckets with > V terms is same as with EXACTLY V terms => INNER DP uses already computed costs

21 Concluding Remarks  Presented techniques for building probabilistic histograms over probabilistic data - Capture full distribution of data items, not just expectations - Support several minimization metrics - Resulting histograms can handle selection, join, aggregation queries  Future Work - Current model assumes independence of items. Seek extensions where this assumption does not hold - Running time improvements (1+ε)-approximate solutions [Guha, Koudas, Shim: ACM TODS 2006] Prune search space (i.e., very large buckets) using lower bounds for bucket costs