The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

Slides:

Advertisements

Similar presentations

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Advertisements

Scalability, from a database systems perspective Dave Abel.

Principles of Density Estimation

Nonparametric Methods: Nearest Neighbors

1 Query Processing in Spatial Network Databases presented by Hao Hong Dimitris Papadias Jun Zhang Hong Kong University of Science and Technology Nikos.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.

CS4432: Database Systems II

CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Fast Algorithms For Hierarchical Range Histogram Constructions

Danzhou Liu Ee-Peng Lim Wee-Keong Ng

Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.

Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.

Classification and Decision Boundaries

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

SASH Spatial Approximation Sample Hierarchy

Kernel methods - overview

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Lecture Notes for CMPUT 466/551 Nilanjan Ray

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Dimensionality Reduction

Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.

R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

VLDB '2006 Haibo Hu (Hong Kong Baptist University, Hong Kong) Dik Lun Lee (Hong Kong University of Science and Technology, Hong Kong) Victor.

Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.

Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Continual Neighborhood Tracking for Moving Objects Yoshiharu Ishikawa Hiroyuki Kitagawa Tooru Kawashima University of Tsukuba, Japan

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.

R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.

Location-based Spatial Queries AGM SIGMOD 2003 Jun Zhang §, Manli Zhu §, Dimitris Papadias §, Yufei Tao †, Dik Lun Lee § Department of Computer Science.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.

Spatial Data Management

Data Transformation: Normalization

SIMILARITY SEARCH The Metric Space Approach

A Black-Box Approach to Query Cardinality Estimation

15-826: Multimedia Databases and Data Mining

K Nearest Neighbor Classification

Lecture 10: Sketching S3: Nearest Neighbor Search

Chap 8. Instance Based Learning

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

R-trees: An Average Case Analysis

Presentation transcript:

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST

Tao, Faloutsos, Papadias2 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

Tao, Faloutsos, Papadias3 Target query types DB = set of m –d points. n Range search (RS) n k nearest neighbor (KNN) n Regional distance (self-) join (RDJ)  in Louisiana, find all pairs of music stores closer than 1mi to each other

Tao, Faloutsos, Papadias4 Target problem Estimate n Query selectivity n Query (I/O) cost n for any Lp metric n using a single method

Tao, Faloutsos, Papadias5 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

Tao, Faloutsos, Papadias6 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

Tao, Faloutsos, Papadias7 Older Query estimation approaches n Vast literature  Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc  BUT: They target specific cases (mostly range search selectivity under the L  norm), and their extensions to other problems are unclear

Tao, Faloutsos, Papadias8 Main competitors n Local method  Representative methods: Histograms n Global method  Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations  Representative methods: Fractal and power law

Tao, Faloutsos, Papadias9 Rationale and problems of histograms n Partition the data space into a set of buckets and assume (local) uniformity n Problems  uniformity  tricky/slow estimations, for all but the L  norm

Tao, Faloutsos, Papadias10 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

Tao, Faloutsos, Papadias11 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10

Tao, Faloutsos, Papadias12 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10

Tao, Faloutsos, Papadias13 “Density Trap” n Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! n This ‘trap’ will appear for any non- uniform dataset n Almost ALL real point-sets are non-uniform -> the trap is real

Tao, Faloutsos, Papadias14 “Density Trap” In short: is meaningless n What should we do instead?

Tao, Faloutsos, Papadias15 “Density Trap” In short: is meaningless n What should we do instead? n A: log(count_of_neighbors) vs log(area)

Tao, Faloutsos, Papadias16 Local power law n In more detail: ‘local power law’:  nb: # neighbors of point p, within radius r  c p : ‘local constant’  n p : ‘local exponent’ (= local intrinsic dimensionality)

Tao, Faloutsos, Papadias17 Local power law Intuitively: to avoid the ‘density trap’, use n n p :local intrinsic dimensionality n instead of density

Tao, Faloutsos, Papadias18 Does LPL make sense? n For point ‘q’: LPL gives nb q (r) = r 1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 =

Tao, Faloutsos, Papadias19 Local power law and Lx if a point obeys L.P.L under L , ditto for any other Lx metric, with same ‘local exponent’ -> LPL works easily, for ANY Lx metric

Tao, Faloutsos, Papadias20 Examples p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2 radius #neighbors(<=r) p1 p2

Tao, Faloutsos, Papadias21 Examples

Tao, Faloutsos, Papadias22 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

Tao, Faloutsos, Papadias23 Proposed method n Main idea: if we know (or can approximate) the c p and n p of every point p, we can solve all the problems:

Tao, Faloutsos, Papadias24 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

Tao, Faloutsos, Papadias25 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

Tao, Faloutsos, Papadias26 Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends n only on the ‘local exponent’ n and NOT on the ‘local constant’, n nor on the cardinality of the dataset

Tao, Faloutsos, Papadias27 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point.  Q: What to do?

Tao, Faloutsos, Papadias28 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point.  Q: What to do?  A: exploit locality:

Tao, Faloutsos, Papadias29 Implementation n nearby points: usually have similar local constants and exponents. Thus, one solution: n ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q

Tao, Faloutsos, Papadias30 Implementation n choose anchors: with sampling, DBS, or any other method.

Tao, Faloutsos, Papadias31 Implementation n (In addition to ‘anchors’, we also tried to use ‘patches’ of near- constant cp and np – it gave similar accuracy, for more complicated implementation)

Tao, Faloutsos, Papadias32 Experiments - Settings n Datasets  SC that contain 40k points representing the coast lines of Scandinavia  LB that include 53k points corresponding to locations in Long Beach county n Structure: R*-tree n Compare Power method to  Minskew  Global method (fractal)

Tao, Faloutsos, Papadias33 Experiments - Settings n The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods n Queries: Biased (following the data distribution)  A query workload contains 500 queries  We report the average error  i |act i  est i |/  i act i

Tao, Faloutsos, Papadias34 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

Tao, Faloutsos, Papadias35 Range search selectivity n the LPL method wins

Tao, Faloutsos, Papadias36 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

Tao, Faloutsos, Papadias37 n No known global method in this case n The LPL method wins, with higher margin Regional distance join selectivity

Tao, Faloutsos, Papadias38 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

Tao, Faloutsos, Papadias39 Range search query cost

Tao, Faloutsos, Papadias40 k nearest neighbor cost

Tao, Faloutsos, Papadias41 Regional distance join cost

Tao, Faloutsos, Papadias42 Conclusions n We spot the “density trap” problem of the local uniformity assumption (<- histograms) n we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) n and we solved all posed problems:

Tao, Faloutsos, Papadias43 Conclusions – cont’d n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

Tao, Faloutsos, Papadias44 Conclusions – cont’d n for any Lp metric (Lemma3.2) n using a single method (LPL & ‘anchors’) RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5