Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

Slides:

Advertisements

Similar presentations

The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

The simplex algorithm The simplex algorithm is the classical method for solving linear programs. Its running time is not polynomial in the worst case.

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

指導教授：陳良弼老師報告者：鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Representing and Querying Correlated Tuples in Probabilistic Databases

Cleaning Uncertain Data with Quality Guarantees Dr. Reynold Cheng Department of Computer Science The University of Hong Kong

Best-Effort Top-k Query Processing Under Budgetary Constraints

Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

Data Engineering Research Group 4 faculty members Reynold Cheng David Cheung Ben Kao Nikos Mamoulis 20 research students (10 PhD, 10 MPhil)

Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.

Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

U-DBMS: A Database System for Managing Constantly-Evolving Data (VLDB 2005) Reynold Cheng Hong Kong Polytechnic University.

Data Engineering Research Group 4 faculty members Reynold Cheng David Cheung Ben Kao Nikos Mamoulis 20 research students (10 PhD, 10 MPhil)

1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Cheng, Xie, Yiu, Chen, Sun UV-diagram: a Voronoi Diagram for uncertain data 26th IEEE International Conference on Data Engineering Reynold Cheng (University.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.

Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar Department of Computer Science, Purdue.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Adaptive Stream Filters for Entity-based Queries with Non-value Tolerance VLDB 2005 Reynold Cheng (Speaker) Ben Kao, Alan Kwan Sunil Prabhakar, Yicheng.

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Reynold Cheng†, Eric Lo‡, Xuan S

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Minimum Cost Flows. 2 The Minimum Cost Flow Problem u ij = capacity of arc (i,j). c ij = unit cost of shipping flow from node i to node j on (i,j). x.

Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.

Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 Probabilistic Continuous Update.

Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

ICS280 Presentation by Suraj Nagasrinivasa (1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar.

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.

Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Calculating frequency moments of Data Stream

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

March 7, Using Pattern Recognition Techniques to Derive a Formal Analysis of Why Heuristic Functions Work B. John Oommen A Joint Work with Luis.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Preference Query Evaluation Over Expensive Attributes

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Probabilistic Data Management

Probabilistic Data Management

Probabilistic Data Management

Sequential Data Cleaning: A Statistical Approach

Probabilistic Databases

Uncertain Data Mobile Group 报告人：郝兴.

Data Engineering Research Group

Further Topics on Random Variables: Derived Distributions

Further Topics on Random Variables: Derived Distributions

Ensuring Correctness over Untrusted Private Database

Further Topics on Random Variables: Derived Distributions

Presentation transcript:

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng

Outline  Background  Related works  Data and Query model  PWS-quality model  Cleaning procedure  Experiments result

Uncertain Database(old model)  Inherent in various application  Examples:  RFID data  sensor networks  data protected because of privacy reason  Infeasible to eliminate all uncertainty in many models

Uncertain Database(new model)  Previous model focus on query in the uncertain database  But what if we are able to reduce SOME of the uncertainty in this kind of database?  New model are required to produce optimal solution

Example 1: Sensor probing  Some sensors in the sensor network might have transmission problems and cannot update data  Commands can be sent to refresh some sensors  New certain data are obtained  Limited by the bandwidth / battery power, cannot probe too often

Example 2: Movie Rating  Movie ratings(IMDB, Netflix) collected from customers might contain some uncertainty  managers can communicate with customers to verify the rating data  New certain movie rating data is obtained  Limited by the human power or other resource

Cleaning Data Uncertain DB Query Ambiguous result LESS Uncertain DB Query LESS ambiguous result Cleaning procedure

Real model example  A database of some products and theirs price(uncertain) KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 Price of product a has two different possible values: 120 (prob 0.7 ) or 80 (prob 0.3)

Query Example 1: KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 Query 1(Range Query): Select product with price in range [100$, 110$] Possible world result: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2)

Query Example 2: KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 Query 2 (Max query): Select product with highest price Possible world answer: ({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036) ({b1}, 0.06), ({b1, c2}, 0.054) ({c3}, 0.054)

Clean up example  Suppose we have some amount of resource to clean up some data  Assume we clean up the information related to product a and c New database with less uncertainty KeyProduct ID Price ($)Prob. a2a2 a801 b1b1 b b2b2 b900.4 c3c3 c1001 d1d1 d101

Clean up example (Cont.) KeyProduct ID Price ($)Prob. a2a2 a801 b1b1 b b2b2 b900.4 c3c3 c1001 d1d1 d101 Run query 1 again: Select product with price in range [100$, 110$] New possible world result: ({b1,c3}, 0.6), ({c3}, 0.4) Old possible result: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2) Apparently less uncertain in the cleaned database, but clean up procedure limited by budget New database with less uncertainty

 Background  Related works  Data and Query model  PWS-quality model  Cleaning procedure  Experiments result Outline

Important related works  Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003:  Mentioned about the ideas of doing clean up in Max/Min and Range query, but not real implementation  P. Andritsos, A. Fuxman, and R. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE,  Introduce the technique to rewrite query

Important related works (Cont)  Jinchuan Chen, Reynold Cheng: Quality-Aware Probing of Uncertain Data with Resource Constraints. SSDBM 2008  Similar cleaning method  continuous pdf function representation of uncertainty  Support less query type(only range query)  Chris Mayfield, Jennifer Neville, Sunil Prabhakar ERACER: A Database Approach for Statistical Inference and Data Cleaning SIGMOD 2010  Use the attribute level correlation to provide optimized clean up

 Background  Related works  Database and Query model  PWS-quality model  Cleaning procedure  Experiments result Outline

System Structure

Important Notations KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 tuple t i (total n tuples) x-tuple τ i (total m x-tuple ) uncertain attribute existential probability (e i ) One x-tuple

Important Notations KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 tuple t i (total n tuples) x-tuple τ i (total m x-tuple) uncertain attribute existential probability (e i ) One x-tuple

Query in possible world model (b1,0.28), (c2,0.18), (c3,0.1) {b1,c2}, 0.18 {b1,c3}, Qualification probability(p i ) of c2: 0.18 Qualification probability(P k ) of c: 0.28

Possible Range Query(PRQ)  Given a closed interval, where and, a PRQ returns a set of tuples, where is the non-zero probability that. KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 Range Query: Select product with price in range [100$, 110$] Possible world result set: ({b1,c2}, 0.18), ({b1,c3}, 0.12), ({b1},0.3), ({c2},0.12), ({c3}, 0.08), ({Φ},0.2) Prob. q j of occurrence

Probabilistic Maximum Query(PMaxQ)  A PMaxQ returns a set of tuples, where, the probability of, is the non-zero probability that, where and. KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 Query: Select product with highest price Possible world answer: ({c1}, 0.5), ({a1}, 0.35), ({c2} 0.036) ({b1}, 0.06), ({b1, c2}, 0.054) ({c3}, 0.054)

 Background  Related works  Data and Query model  PWS-quality model  Cleaning procedure  Experiments result Outline

PWS-quality  Suppose we have two sets of possible world result: {a2,b1} {a1,b2,c1} {b3,c2} {b1} 0.3 {a1,c1} We need a measurement to tell which result is more uncertain and by how Solution: Use entropy like measurement to calculate the PWS-quality (degree of uncertainty)

PWS-Quality: Calculation  Let q j be the prob. of getting distinct PW-result r j  Let d be the number of distinct pw-result  Negative S(D, Q) score, larger the score, better the quality  0 means no uncertainty(only 1 possible world result exist)

PWS-quality example  Suppose we have a set of possible world result: PWS score: S(D,Q) = 0.5*log *log *log0.1 = {b1} {a1,c1} 0.5 {b2}

PWS-quality problem  However, calculating PWS-quality for all possible worlds are too expensive  # of possible world result might be exponential  Need to speed up the algorithm

x-Form PWS-Quality  x-Form of PWS-Quality  g(k,D,Q)= func(existential & qualification probs. of tuples in k-th x-tuple)  Summation of quality information of all the result x-tuples  Only consider x-tuples whose tuples are in query answer

x-Form of PRQ (Range Query)  Each g(k, D, Q) only require O(| τ k |) time  p i and P k are the qualification probability of the current tuple t i and current x-tuple t K which can be calculated easily

x-Form of PMaxQ (Max Query)  Require O(| τ k | 2 ) to calculate g(k, D, Q) for PMaxQ  Details of the proof will be talked at the end of present

x-form PWS-quality summary  By transforming the original PWS-quality calculation to the x-form PWS calculation, we avoid the exponential computation time  Total computation time O(m log(n/m))  Compared to the query time, the x-form PWS-quality calculation time is small. (will be shown in the experiment)

 Background  Related works  Data and Query model  PWS-quality model  Cleaning procedure  Experiments result Outline

Cleaning with limited budget  With a limited budget, say, 10 Units, which tuples should we clean? KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c1c1 c c2c2 c c3c3 c d1d1 d101 Clean cost: 5 unit Clean cost: 7 unit Clean cost: 10 unit

Example of cleaning  After Cleaning, the tuple existential probability become 1  This x-tuple contracted to 1 single tuple with certain attribute value KeyProduct ID Price ($)Prob. a1a1 a a2a2 a800.3 b1b1 b b2b2 b900.4 c3c3 c1001 d1d1 d101

Quality improvement  Expected Quality after cleaning  The set of x-tuple that we are going to clean is represented by X = { τ 1, ···, τ |x| }  Quality Improvement But quality improvement calculation is exponential

Computation example: KeyProduct ID Price ($)Prob.QP a1a1 a a2a2 a b1b1 b b2b2 b c1c1 c c2c2 c c3c3 c d1d1 d1010 Query 2 (Max query): Select product with highest price if we decided to clean up x- tuple c

Computation example (Cont.): KeyProduct ID Price ($)Prob.QP a1a1 a a2a2 a b1b1 b b2b2 b c1c1 c c2c2 c c3c3 c d1d1 d1010 New PWS-quality S(D’, Q) = Query 2 (Max query): Select product with highest price We decided to clean up x-tuple c one possible case is c3 is the real world case

Computation example (Cont.): KeyProduct ID Price ($)Prob.QP a1a1 a a2a2 a b1b1 b b2b2 b c1c1 c c2c2 c c3c3 c d1d1 d1010 Query 2 (Max query): Select product with highest price We decided to clean up x-tuple c another possible case is c2 is the real world case New PWS-quality S(D’, Q) = -1.17

Computation example (Cont.): KeyProduct ID Price ($)Prob.QP a1a1 a a2a2 a b1b1 b b2b2 b c1c1 c c2c2 c c3c3 c d1d1 d1010 Query 2 (Max query): Select product with highest price To clean up x-tuple c we have 3 different possible real world scenarios Expected quality of cleaning up x-tuple c = 0 * (-1.17) * (- 1.17) * 0.2 =

x-form quality improvement  calculation of the quality improvement in x-form will become following  X is the set of x-tuple that we are going to clean  proof: rewrite the original E(S(D’(t), Q)) as left side is equal to 0, right side is unchanged after the cleaning

Optimal Data Cleaning Algorithm  in x-form quality improvement problem, we get the following objective function:  c K : the cleaning cost k-th x-tuple  C: total cleaning budget  Z: total number of x-tuple with p i in (0,1)  Can be transformed to 0/1 Knapsack problem

DP algorithm  Time complexity O(CZ) Space Complexity O(CZ 2 )  C: total budget Z: number of x-tuples

Other heuristics methods:  Random  MaxQP  Select x-tuples with highest qualification probability  Greedy:  Rank x-tuples with max expected quality improvement per cleaning cost

 Background  Related works  Data and Query model  PWS-quality model  Cleaning procedure  Experiments result Outline

Experiment set up Size of DB10 K x-tuples, 100 K tuples (synthetic) 4,999 x-tuples, 10,037tuples (Netflix movie ratings) Prob. distributionsGaussian (variance = 100) Cleaning costUniform in [1,10] Resource Budget[20,500] default = 30

PWS-quality(S) vs database size(Z) (PRQ)

Quality evaluation performance(PRQ) (database size)

Running time for Clean up selection(PMaxQ) Total budget

Quality improvement vs Budget(PRQ) Total budget Quality Improvement

Quality improvement vs Budget(PMaxQ) Total budget Quality Improvement

Quality improvement vs Budget(PRQ, real data) Quality Improvement Total budget

Thank you Q & A

Appendix: Deriving x-form of PRQ

Appendix: Deriving x-form of PMaxQ A number in [0, ]