Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,

Slides:



Advertisements
Similar presentations
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Advertisements

指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Representing and Querying Correlated Tuples in Probabilistic Databases
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Cleaning Uncertain Data with Quality Guarantees Dr. Reynold Cheng Department of Computer Science The University of Hong Kong
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Data Engineering Research Group 4 faculty members Reynold Cheng David Cheung Ben Kao Nikos Mamoulis 20 research students (10 PhD, 10 MPhil)
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.
Thomas Bernecker, Tobias Emrich, Hans-Peter Kriegel,
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain Databases Peiwu Zhang Reynold Cheng Nikos Mamoulis Yu Tang University of Hong Kong.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
Data Engineering Research Group 4 faculty members Reynold Cheng David Cheung Ben Kao Nikos Mamoulis 20 research students (10 PhD, 10 MPhil)
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Cheng, Xie, Yiu, Chen, Sun UV-diagram: a Voronoi Diagram for uncertain data 26th IEEE International Conference on Data Engineering Reynold Cheng (University.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]
Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.
SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar Department of Computer Science, Purdue.
Top-k Monitoring in Wireless Sensor Networks Minji Wu, Jianliang Xu, Xueyan Tang, and Wang-Chien Lee IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Adaptive Stream Filters for Entity-based Queries with Non-value Tolerance VLDB 2005 Reynold Cheng (Speaker) Ben Kao, Alan Kwan Sunil Prabhakar, Yicheng.
Crowd-Augmented Social Aware Search Soudip Roy Chowdhury & Bogdan Cautis.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Reynold Cheng†, Eric Lo‡, Xuan S
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
ON INCENTIVE-BASED TAGGING Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung {xyang2, ckcheng, lymo, kao, The University.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Efficient Processing of Top-k Spatial Preference Queries
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Information Technology (Some) Research Trends in Location-based Services Muhammad Aamir Cheema Faculty of Information Technology Monash University, Australia.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South.
Probabilistic Data Management
Preference Query Evaluation Over Expensive Attributes
Lecture 16: Probabilistic Databases
Probabilistic Data Management
IEEE ICDE 2008 Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data Reynold Cheng Hong Kong Polytechnic University.
Probabilistic Data Management
Probabilistic Databases
Uncertain Data Mobile Group 报告人:郝兴.
Data Engineering Research Group
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,

Outline 2  Introduction  Quality Metric for Top-k Queries  Definition  Efficient computation  Results  Cleaning for Top-k Queries  Definition  Solutions  Results  Conclusion

Data Uncertainty 3  Inherent in various applications  Location-based services (e.g., using GPS, RFID)  Natural habitat monitoring with sensor networks  Data integration

4 Uncertain Databases  Model data uncertainty  e.g., tuple t has existential probability e  Enable probabilistic queries  Produce ambiguous query answers  e.g., tuple t has probability p for satisfying a query

“Cleaning” of Uncertain Data Uncertain DB $$ LESS Uncertain DB Query Ambiguous result LESS ambiguous result Fail? 5 A quality metric to quantify the ambiguity of query results

Example: Sensor Probing 6  In natural habitat monitoring, sensors are used to track external environment  The system probes from sensors to refresh stale data  Probes may fail due to network reliability problem  Battery and network resources should be optimized

Related Work: Cleaning Uncertain DB  Cleaning for range/max query [Cheng VLDB’08]  Explore and exploit to disambiguating database [Cheng VLDB’10]  Model different factors of cleaning operations  Consider no probabilistic model or query  Probing from stream source [Chen SSDBM’08]  Range query  Improve integration quality by user feedback [Keulen VLDBJ’09]  Analyze sensitivity of answer to input data [Kanagal SIGMOD’11] 7 We consider uncertain data cleaning for probabilistic top-k queries

Related Work: Top-k Queries 8  Various query semantics  U-Topk, U-kRanks [Soliman 07]  PT-k [Hua 08]  Global-topk [Zhang 08]  Expected Rank [Cormode 09]  ……  Efficient evaluation [Bernecker 10, Yi 08, Li 09, Lian 08] Cleaning for top-k queries is challenging

Our Contributions  Measure quality of query answer for three top-k queries  Adopt PWS-quality  Develop efficient computation for quality score  Clean uncertain data for top-k queries  Model cost, budget, cleaning successfulness  Propose cleaning algorithms to attain the highest expected improvement in PWS-quality 9

Probabilistic Data Model (x-tuple model) 10 Sensor IDKeyTemp. ( o C)Prob. S1S1 t0t t1t S2S2 t2t t3t S3S3 t4t t5t S4S4 t6t6 261 x-tuple Tuple (t i ) Querying Attribute (v i ) Existential probability (e i ) x-tuple i-th tuple

Probabilistic Top-k Queries  U-kRanks  (t 2, t 5 )  PT-k (prob. threshold top-k)  Threshold=0.4  (t 1, t 2, t 5 )  Global-topk  (t 2, t 5 ) 11 Prob. t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 Rank Rank Top Rank Probability Information (k=2)  No work about how to measure the quality of query answers

Probabilistic Top-k Queries 12 Possible World Semantics Rank Probability Information Possible World Results 0.28

The Possible World Semantics Quality (PWS-Quality) [Cheng VLDB’08] 13 Entropy PWS-quality = Expensive to compute!

PWR: Derives PW-Results Directly  No. of distinct pw-results is bounded by n^k (n is the database size)  Advantage:  Reduce complexity 14 Not efficient enough if number of PW-results is large!

TP: Computation based on Rank Prob.  PSR [Bernecker, TKDE10]  An efficient solution framework for top-k query evaluation 15

 PWS-quality can be expressed by the existential probabilities and top-k probabilities of tuples where is some function of existential probabilities of tuples in D TP: Tuple Form of PWS-Quality PWS-quality 16

 Steps of TP:  O(nk) for PSR [Bernecker, TKDE10] to compute all  O(n) for an incremental method to compute all  Rank prob. information can be shared by query and quality evaluation! TP: Sharing of Computation Effort 17 Rank Probability Information

Experiment Setup Size of DB5 K x-tuples, 50 K tuples (synthetic) 4,999 x-tuples, 10,037 tuples (Netflix movie ratings) Prob. distributionsGaussian (variance = 100) Mean of each x-tuple, uniform in [0, 10000] Top-k Queriesk = 15 Threshold for PT-k =  By default, results are shown on synthetic data.

Quality Score vs. k 19

Evaluation Time 20

TP: Effect of Sharing (1) Query+Quality Time vs. k Top-k query: PT-k; Non-sharing: rank probability information is recomputed when computing the quality score 21 48%

TP: Effect of Sharing (2) PT-k Time vs. Quality Time (with sharing) %

Results on Real Data 23 Quality Score vs. kPT-k Time vs. Quality Time (with sharing) Similar to results on synthetic data

Outline 24  Introduction  Quality Metric for Top-k Queries  Definition  Efficient computation  Results  Cleaning for Top-k Queries  Definition  Solutions  Results  Conclusion

Sensor ID KeyTemp. ( o C) Prob.Sc- prob. S1S1 t0t t1t S2S2 t2t t3t S3S3 t4t t5t S4S4 t6t Example Sensor Readings Cost Cleaning may require resources $11 $3 $ 9 $1 Limited budget A budget (e.g., $12) restricts the no. of cleaning actions Successfulness Cleaning action has a successful cleaning probability (sc-prob) Cleaning plan Which x-tuples should be cleaned? How many times the cleaning actions should be performed? 25 Objective Optimize the quality improvement after cleaning

Cleaning Model 26  D: uncertain database, a set of x-tuples  τ l : the l-th x-tuple  c l : cost of cleaning τ l once  p l : successful probability of cleaning actions on τ l  B : cleaning budget  (X, M) : cleaning plan to clean τ l for M l times, where τ l is in X

An Optimization Problem  I(X,M) : expected quality improvement of (X,M) Budget constraint Challenges:  Computation of I(X,M) is nontrivial  number of possible cleaning plans may be exponential 27

 Given a cleaning plan Expected quality of cleaning x-tuple S 3 : = 0.7 * (0.4 * * -1.85) + (1-0.7) * = Expected Quality Improvement Sensor ID Sc- prob. KeyTemp. ( o C) Prob.Top-k Prob. S1S1 0.8 t0t t1t S2S2 0.3 t2t t3t S3S3 0.7 t4t t5t S4S4 0.6 t6t No. of possible cleaned results is exponential! Clean S 3 once 1 PWS-quality = PWS-quality = Cleaning on S 3 is successful Cleaning on S 3 fails

 Given a cleaning plan (X,M) and the tuple form of PWS-quality, the expected quality improvement can be computed in linear time of |X| Efficient Expected Quality Improvement Evaluation 29

Cleaning Algorithms  Optimal solution:  Variant of knapsack problem  DP (dynamic programming)  Heuristics:  RandU (x-tuples have equal prob. to clean)  RandP (x-tuples with higher top-k prob. also have higher prob. to clean)  Greedy (select x-tuples with largest marginal expect quality improvement to clean) 30

Experiment Setup Cleaning costUniform in [1,10] Sc-probabilityUniform in [0,1] Resource budget100 Size of DB5 K x-tuples, 50 K tuples (synthetic) 4,999 x-tuples, 10,037 tuples (Netflix movie ratings) Prob. distributionsGaussian (variance = 100) Top-k Queriesk = 15 Threshold for PT-k =  Results are shown on synthetic data.

Effectiveness of Cleaning Algorithms Improvement vs. Budget 32 I(X,M) Budget

Effect of Avg. sc-probability 33 I(X,M)

Efficiency on Budget x Budget

Efficiency on k x

Conclusion  Efficient computation of PWS-quality for probabilistic top-k query  Cleaning probabilistic database under limited budget  Model cleaning operations  Develop optimal and efficient cleaning algorithms for top-k queries  Future work  Study other probabilistic data model  Support other top-k queries, skyline queries, etc. 36

Thank you! Contact Info: Luyi Mo University of Hong Kong 37

Reference  [Soliman 07] M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang, “Top-k query processing in uncertain databases,” in ICDE, 2007  [Hua 08] M. Hua, J. Pei, W. Zhang, and X. Lin, “Ranking queries on uncertain data: a probabilistic threshold approach,” in SIGMOD, 2008  [Yi 08] K. Yi, F. Li, G. Kollios, and D. Srivastava, “Efficient processing of top-k queries in uncertain databases with x-relations,” TKDE, 2008  [Zhang 08] X. Zhang and J. Chomicki, “On the semantics and evaluation of top-k queries in probabilistic databases,” in ICDE Workshop, 2008  [Cormode 09] G. Cormode, F. Li, and K. Yi, “Semantics of ranking queries for probabilistic data and expected ranks,” in ICDE, 2009  [Bernecker 10] T. Bernecker, H. Kriegel, N. Mamoulis, M. Renz, and A. Zuefle, “Scalable probabilistic similarity ranking in uncertain databases,” TKDE, 2010  [Cheng 08] R. Cheng, J. Chen, and X. Xie, “Cleaning uncertain data with quality guarantees,” 2008  [Li 09] J. Li, B. Saha, and A. Deshpande, “A unified approach to ranking in probabilistic databases,” 2009  [Lian 08] X. Lian and L. Chen, “Probabilistic ranked queries in uncertain databases,” in EDBT08  [Keulen 09] M. van Keulen and A. de Keijzer, “Qualitative effects of knowledge rules and user feedback in probabilistic data integration,” The VLDB Journal, 2009  [Kanagal 11] B. Kanagal, J. Li, and A. Deshpande, “Sensitivity analysis and explanations for robust query evaluation in probabilistic databases,” in SIGMOD, 2011  [Cheng 10] R. Cheng, E. Lo, X. S. Yang, M.-H. Luk, X. Li, and X. Xie, “Explore or exploit? effective strategies for disambiguating large databases,” 2010  [Chen 08] J. Chen and R. Cheng, “Quality-aware probing of uncertain data with resource constraints,” in SSDBM, 2008  [Cheng04] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter. Efficient indexing methods for probabilistic threshold queries over uncertain data. In VLDB,  [Tao05]Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In VLDB,

Related Works 39 Data Models  Independent tuple/attribute uncertainty [Barbara92]  x-tuple (ULDB) [Benjelloun06]  Graphical model [Sen07]  Categorical uncertain data [Singh07]  World-set descriptor sets [Antova08] Query Evaluation  Probabilistic Query Classification [Cheng 03]  Efficiency of query evaluation [Dalvi04]  Range queries [Cheng04,Tao05,Cheng07]  MIN/MAX [Cheng03,Deshpande04]  Top-k query evaluation [Soliman07,Re07,Yi08, Bernecker 10,Li 09,Lian 08]

Related Works 40 Quality metric for uncertain DB  Result probability > threshold [Cheng04, Desphande04]  PWS-quality (Possible World Semantics Quality) [Cheng 08]  Number of alternatives (non-prob. DB) [Cheng 10]

Example: PT-k 41 Sensor IDKeyTemp. ( o C)Prob. S1S1 t0t t1t S2S2 t2t t3t S3S3 t4t t5t S4S4 t6t6 261 Return sensors which have at least 40% to yield 2 highest temperature PT-k with k = 2, T = 0.4 ResultProb PW-Results

Example: cleaning objective 42 Sensor IDKeyTemp. ( o C)Prob. S1S1 t0t t1t S2S2 t2t t3t S3S3 t4t t5t S4S4 t6t Return sensors which yield 2 highest temperature The database may be cleaned by probing the sensors to attain its latest reading Suppose we clean sensor S 3. PWS-quality=-1.85PWS-quality = -2.55

Example: PT-k 43 ResultProb ResultProb PWS-quality=-1.85 PWS-quality = -2.55

The Possible World Semantics Quality (PWS-Quality) [Cheng 08] PWS-quality= Entropy PWS-quality = Expensive to compute! If some uncertainty of the DB is removed

PWR: PW-Results Derivation and Probability Computation  Derivation O(n^k)  Enumerate all combinations with exactly k tuples  When tuples are pre-sorted  pruning techniques  Probability Computation O(n)  If the pw-result is given, tuples exist in pw-result tuples with high score do not exist in pw-result 45 τ

TP: Tuple Form of PWS-Quality  PWS-quality can be expressed by the existential probabilities and top-k probabilities of tuples where is some function of existential probabilities of tuples in the same x-tuple with and ranked higher PWS-quality 46

TP: Example t1t1 t2t2 t5t5 t6t6 t4t4 t3t3 t0t early stop Quality score =

Results on Real Data 48 Quality Score vs. k

Results on Real Data 49 Quality and Query Evaluation Time with Sharing

Results on Real Data 50

Comparison with PW 51

Effect of sc-pdf (Cleaning Algorithms) 52

Effect of Avg. sc-probability (Cleaning Algorithms) 53

Efficiency on k (Cleaning Algorithms) 54