IEEE ICDE 2008 Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data Reynold Cheng Hong Kong Polytechnic University.

Slides:



Advertisements
Similar presentations
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Advertisements

Efficient Evaluation of k-Range Nearest Neighbor Queries in Road Networks Jie BaoChi-Yin ChowMohamed F. Mokbel Department of Computer Science and Engineering.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Cleaning Uncertain Data with Quality Guarantees Dr. Reynold Cheng Department of Computer Science The University of Hong Kong
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Fast Algorithms For Hierarchical Range Histogram Constructions
Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen.
Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain Databases Peiwu Zhang Reynold Cheng Nikos Mamoulis Yu Tang University of Hong Kong.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
U-DBMS: A Database System for Managing Constantly-Evolving Data (VLDB 2005) Reynold Cheng Hong Kong Polytechnic University.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
--Presented By Sudheer Chelluboina. Professor: Dr.Maggie Dunham.
Cheng, Xie, Yiu, Chen, Sun UV-diagram: a Voronoi Diagram for uncertain data 26th IEEE International Conference on Data Engineering Reynold Cheng (University.
Evaluating Hypotheses
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar Department of Computer Science, Purdue.
Adaptive Stream Filters for Entity-based Queries with Non-value Tolerance VLDB 2005 Reynold Cheng (Speaker) Ben Kao, Alan Kwan Sunil Prabhakar, Yicheng.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Reynold Cheng†, Eric Lo‡, Xuan S
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 Probabilistic Continuous Update.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Efficient Processing of Top-k Spatial Preference Queries
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.
SIMILARITY SEARCH The Metric Space Approach
Chapter 5 STATISTICAL INFERENCE: ESTIMATION AND HYPOTHESES TESTING
12. Principles of Parameter Estimation
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
By: Sibo Wang, Xiaokui Xiao, Yin Yang, Wenqing Lin
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
Probabilistic Robotics
Clustering Uncertain Taxi data
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Visualization of query processing over large-scale road networks
Preference Query Evaluation Over Expensive Attributes
Spatial Online Sampling and Aggregation
Outlier Discovery/Anomaly Detection
Chapter 4: Probabilistic Query Answering (2)
Probabilistic Data Management
Random Sampling over Joins Revisited
Probabilistic Data Management
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Probabilistic Data Management
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Minimizing the Aggregate Movements for Interval Coverage
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Range Queries on Uncertain Data
Uncertain Data Mobile Group 报告人:郝兴.
Continuous Density Queries for Moving Objects
12. Principles of Parameter Estimation
Objectives 6.1 Estimating with confidence Statistical confidence
Efficient Processing of Top-k Spatial Preference Queries
Objectives 6.1 Estimating with confidence Statistical confidence
Presentation transcript:

IEEE ICDE 2008 Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data Reynold Cheng Hong Kong Polytechnic University csckcheng@comp.polyu.edu.hk http://www.comp.polyu.edu.hk/~csckcheng Jinchuan Chen (csjcchen@comp.polyu.edu.hk) Hong Kong Polytechnic University Mohamed Mokbel, Chi-Yin Chow ({mokbel,cchow}@cs.umn.edu) The University of Minnesota-Twin Cities

Location and Sensor Applications Find a cab closest to my current location. What is the region that gives max temperature? GPS sensor network Service Provider Positioning technologies like GPS, GSM, RF-ID and WiFi have developed rapidly in recent years. They allow locations of users to be decided, and enable a new class of applications know as Location-based-service. Examples: A moving object database monitors locations of mobile devices An air-conditioning system uses temperature sensors to adjust the temperature of each room Sensors are used to detect if hazardous materials are present and how they are spreading RF-ID Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Data Uncertainty Measurement error [TDRP98, ISSD99] Sampling error [TDRP98, ISSD99] Network latency [TKDE04] Manually injected by users to protect location privacy [PET06,VLDB06] Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Attribute Uncertainty Model [TDRP98, ISSD99,VLDB04b] pdf y (pdf) Uncertainty region We represent an uncertainty pdf as a histogram Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Probabilistic Nearest Neighbor Query (PNN) [TKDE04] INPUT A query point called q A set of n objects X1,X2,…, Xn with uncertainty regions and pdfs OUTPUT A set of (Xi,pi) tuples pi is the non-zero probability (qualification probability) that Xi is the nearest neighbor of q Prior work addresses range queries over a limited model of uncertainty (normal distribution and straight line movement). Here we illustrate a more difficult query, involving interaction between objects to derive a result (nearest neighbor). Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Basic Solution [TKDE04] di(r): distance pdf of Xi from q Di(r): distance cdf of Xi from q ni: smallest distance of Xi from q f: shortest max distance of all objects from q X5 X3 f X1 n1 q Once we know these values of these functions from the particular uncertainty model (line uncertainty and free-moving uncertainty), we can plug in the values to this algorithm and compute the probabilities. As an illustration, we briefly describe how these four parameters are derived for free-moving uncertainty. X4 X6 X2 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

2 Assumptions A user only needs answers with confidence higher than some threshold Approximation of qualification probabilities is allowed Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Constrained Probabilistic Nearest-Neighbor Query (C-PNN) Denote pi.l: lower bound of pi pi.u: upper bound of pi P: Probability threshold ∆: Tolerance Given (P, ∆), return a set {Xi}: pi.u  P, and pi.l  P, or pi.u – pi.l  ∆ Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Illustrating C-PNN (with P=0.8, ∆=0.15) pi.u P=0.8 P=0.8 pi.l To be refined Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Compute [pi.l,pi.u] for any distance pdf Intuition If [pi.l, pi.u] is known, whether Xi satisfies C-PNN can be computed without knowing pi. p3.u  1-0.3 p1.l  0.3 Compute [pi.l,pi.u] for any distance pdf Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Solution Framework Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Probabilistic Verifiers Test if Xi satisfies, or fails the query In ascending order of computational complexity Xi User Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Candidates (After filtering) Incremental Refinement Example: P=0.5,Δ=0.15 Candidates (After filtering) 1 Classifier A 1 0.4 Incremental Refinement Verifier 1 0.6 0.3 0.35 0.48 0.13 ? B 1 0.4 0.54 0.14 C  1 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Partitioning uncertainty pdfs into subregions Generally explain the subregions. Next, we show how to generate these subregions based on the query point and the uncertain data objects. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

End-Points S1 S2 S3 S4 S5 f e1 e2 e3 e4 e5 e6 The end-points include: All near points The minimum and maximum of far points The point which the distance pdf changes Then the subregions are formed by the adjacent pairs of end-points. e1 e2 e3 e4 e5 e6 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Subregion Data Structure Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Rightmost-Subregion (RS) Verifier X3 has no chance to be the nearest neighbor when R2 > f2. p3  1-0.3=0.7 p1  1-0.2=0.8 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

RS Verifier p3  0.7 p1  0.8 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

L-SR and U-SR Verifiers No. of objects in subregion Sj Qualifcation prob. of Xi in subregion Sj Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

L-SR and U-SR Verifiers There are lots of combinations for the values of other objects, e.g. R2 and R3. The value of q_ij will be more precise if more combinations are included into the calculation. It is not possible to consider all the combinations. We bound the value of q_ij by choosing a small part of combinations which are easy to handle and contribute an important part to q_ij. e3 e4 q13 =1 if both R2 and R3 are larger than e4 q13 =0 if either R2 or R3 are smaller than e3 q13 =1/3 if both R2 or R3 are insider S3 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Complexity of Verifiers Algorithm Qualification Prob. Bound Cost RS Upper O(|C|) L-SR Lower O(|C|M) U-SR |C|=no. of candidates with non-zero prob. M= no. of subregions Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Incremental Refinement We can decompose the refinement of qualification probability into a series of probability calculations inside subregions. Once we have computed the probability qij for subregion Sj , we collapse [qij .l, qij.u] into a single value qij , update the probability bound [pi.l, pi.u], and test this new bound with classifier. We repeat this process with another subregion until we can classify the object. For example, after calculating the exact values of q11 and q12, we may find R1 could be rejected and the refinement process is finished. [p2.l, p2.u] = [q21.l,q21.u]*0.3 + [q22.l,q22.u]* 0.3 + [q23.l,q23.u] * 0.4 [p2.l, p2.u] = q21*0.3 + [q22.l,q22.u]* 0.3 + [q23.l,q23.u] * 0.4 [p2.l, p2.u] = q21* 0.3 + q22* 0.3 + [q23.l,q23.u] * 0.4 p2 = q21* 0.3 + q22* 0.3 + q23* 0.4 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Experiment Setup Uncertain Object DB Long Beach (53k) (http://www.census.gov/geo/www/tiger/) Uncertainty pdf Uniform (default) Gaussian (μ: center, : 1/6 of range) Size of R-Tree/PTI Node 4kbytes Threshold (P) 0.3 Delta (∆) 0.01 Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

1. Effect of Filtering The fraction of total time spent in these two operations on synthetic data sets with different candidate set sizes. As the total table size |T| increases, the time spent on the Basic solution increases more than filtering, and so its running time starts to dominate the filtering time when the candidate set size is larger than 5000. As we will show next, other methods can alleviate this problem. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

2. Effect of Verification 5 times We compare the time required by the three evaluation strategies under a wide range of values of P. Both Refine and VR perform better than Basic. At P = 0.3, for instance, the costs for Refine and VR are 80% and 16% of Basic respectively. The reason is that both techniques allow query processing to be finished once all objects have been determined, without waiting for the exact qualification probabilities to be computed. For large values of P, most objects can be classified as fail quickly when their upper probability bounds are detected to be lower than P. Moreover, VR is consistently better than Refine; it is five times faster than Refine at P = 0.3, and 40 times faster at P = 0.7. This can be explained by next figure. 40 times Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

2. Analysis of VR it shows the execution time of filtering, verification and refinement for VR. While the filtering time is fixed, the refinement time decreases with P. The verification takes only 1ms on average, and it significantly reduces the number of objects to be refined. In fact, when P > 0.3, no more qualification probabilities need to be computed. Thus, VR produces a better performance than Refine. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

3. Effect of Threshold The above figure shows the fraction of objects labeled unknown after the execution of verifiers in the order: {RS, L-SR, U-SR}. This fraction reflects the amount of work needed to finish the query. At P = 0.1, about 75% of unknown objects remain after the RS is finished; 7% more objects are removed by L-SR; 15% unknown objects are left after the U-SR is executed. When P is large, RS and U-SR perform better, since they reduce upper probability bounds, so that the objects have a higher chance of being labeled as fail. L-SR works better for small P (as seen from the gap between the RS and L-SR curves). L-SR increases the lower probability bound, so that an object is easier to be classified as satisfy at small P. In this experiment, U-SR performs better than L-SR. This is because the candidate set size is large (about 96 objects), so that the probabilities of the objects are generally quite small. Since U-SR reduces their upper probability bounds, they are relatively easy to be verified as fail, compared with L-SR, which attempts to raise their lower probability bounds. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

4. Effect of Tolerance Now, we measure the fraction of queries finished after verification under different tolerance. This figure shows that as Δ increases from 0 to 0.2, more queries are completed. When Δ = 0.16, about 10% more queries will be completed than when Δ = 0. Thus, the use of tolerance can improve query performance. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

5. Gaussian pdf Finally, we examine the effect of using a Gaussian distribution as the uncertainty pdf for each object. Each Gaussian pdf, approximated by a 300-bar histogram, has a mean at the center of its range, and a standard deviation of 1/6 of the width of the uncertainty region. Figure 14 shows the time drawn in log scale. VR again outperforms the other two methods. The saving is more significant than when uniform pdf is used. This is because the probability evaluation of Gaussian pdf is expensive, but this operation can be effectively avoided by the verifiers. This experiment shows that our method also works well with Gaussian pdf. The little time cost for both Refine and VR at threshold P = 1 is due to the fact that only one candidate, if any, can satisfy the query at P = 1. By checking against these conditions, both methods can accept or reject candidate objects with ease. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Related Works PNNQ R-tree based [TKDE04] Monte-Carlo based [DASFAA07] Line-approximation of uncertainty pdf [ICDE07b] Range Queries [DPD99, ISSD99, VLDB04a, VLDB05, ICDE07a] Top-k Queries [ICDE07c, ICDE08b, ICDE08c] Skylines [VLDB07] and reverse skylines [SIGMOD08] Identification in uncertain biometric database [ICDE06] Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Other Uncertainty Models Probabilistic Database: each tuple is augmented with a probability value (tuple uncertainty) Dalvi & Suciu [VLDB04b,ICDE07d] studied efficient query operator evaluation with ranked results. [VLDB06, ICDE08b] combined the attribute and tuple uncertainty models. A large branch of work deals with fuzzy modeling [IGP06]. Acyclic data structure (Hung,Getoor & Subrahmanian) [ICDE03] Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

References [TKDE04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving object environments. IEEE TKDE, 16(9), Sept. 2004. [SIGMOD03] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” in Proc. ACM SIGMOD, 2003. [DASFAA07] H. Kriegel, P. Kunath, and M. Renz, “Probabilistic nearest-neighbor query on uncertain objects,” in DASFAA, 2007. [ICDE06]C. Bohm, A. Pryakhin, and M. Schubert, “The gauss-tree: Efficient object identification in databases of probabilistic feature vectors,” in Proc. ICDE, 2006. [ICDE07a] J. Chen and R. Cheng, “Efficient evaluation of imprecise locationdependent queries,” in Proc. ICDE, 2007. [IDG06] J. Galindo, A. Urrutia and M. Piattini. Fuzzy Databases: Modeling, Design, and Implementation. Ideas Group Publishing, 2006. [ICDE08b[ M. Hua, J. Pei, X. Lin and W. Zhang. Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data, ICDE 2008. [SIGMOD08] X. Lian and L. Chen. Monochromatic and bichromatic reverse skyline search over uncertain databases. In Proc. SIGMOD, 2008. [ICDE08c] K. Yi, F. Li, D. Srivastava, and G. Kollios. Efficient processing of top-k queries in uncertain databases. In Proc. ICDE, 2008. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

References [VLDB05]Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexing multi-dimensional uncertain data with arbitrary probability density functions,” in Proc. VLDB, 2005 [VLDB04b] N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic Databases. VLDB 2004. [ICDE07d] Chris Re, Nilesh Dalvi, Dan Suciu. Efficient Top-k Query Evaluation on Probabilistic Data. ICDE, 2007 [VLDB04c] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein and W. Hong. Model-Driven Data Acquisition in Sensor Networks. In VLDB, 2004. [VLDB06] O. Mar, A. Sarma, A. Halevy, and J. Widom. ULDBs: databases with uncertainty and lineage. In VLDB, 2006. [ICDE07b] V. Ljosa and A. K. Singh. APLA: Indexing arbitrary probability distributions. In Proc. ICDE, 2007. [ADI00] Y. Manolopoulos, Y. Theodoridis, and V. J. Tsotras. Chapter 4: Access methods for intervals. In Advanced Database Indexing, Kluwer, 2000. [VLDB07] J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In Proc. VLDB, 2007. [DPD99] O. Wolfson, P. Sistla, S. Chamberlain, and Y. Yesha. Updating and querying databases that track mobile units. Distributed and Parallel Databases, 7(3), 1999. [ISSD99] D. Pfoser and C. S. Jensen. Capturing the Uncertainty of Moving-Object Representations, in Proc. of the Sixth International Symposium on Spatio Databases, Hong Kong, July 20-23, 1999, pp. 111-132. [ICDE08a] Singh et al. Database support for pdf attributes. In Proc. ICDE, 2008. [ICDE07c] M. Soliman, I. Ilyas, and K. Chang. Top-k query processing in uncertain databases. In ICDE, 2007. Probabilistic Verifiers Cheng, Chen, Mokbel, Chow

Conclusions To avoid expensive evaluation of PNNQ, we propose the notion of constrained PNNQ (P, ∆). We present a framework which gradually refines the bounds of qualification probabilities. RS, L-SR, and U-SR verifiers Incremental Refinement The method deals with arbitrary uncertainty pdf Probabilistic Verifiers Cheng, Chen, Mokbel, Chow