CMU SCS 15-826: Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

Trees for spatial indexing
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing.
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #25: Multimedia indexing C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals - introduction C. Faloutsos.
15-826: Multimedia Databases and Data Mining
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
CMU SCS Graph and stream mining Christos Faloutsos CMU.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
Introduction to Fractals and Fractal Dimension Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional.
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals – examples & algo’s C. Faloutsos.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
15-826: Multimedia Databases and Data Mining
Next Generation Data Mining Tools: SVD and Fractals
Spatial Indexing.
Indexing and Data Mining in Multimedia Databases
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
R-trees: An Average Case Analysis
15-826: Multimedia Databases and Data Mining
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos

CMU SCS Copyright: C. Faloutsos (2011)2 Must-read Material Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p , 1995 Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension

CMU SCS Copyright: C. Faloutsos (2011)3 Optional Material Optional, but very useful: Manfred Schroeder Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991

CMU SCS Copyright: C. Faloutsos (2011)4 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

CMU SCS Copyright: C. Faloutsos (2011)5 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods –z-ordering –R-trees –misc fractals –intro –applications text...

CMU SCS Copyright: C. Faloutsos (2011)6 Indexing - Detailed outline fractals –intro –applications disk accesses for R-trees (range queries) dimensionality reduction selectivity in M-trees dim. curse revisited “fat fractals” quad-tree analysis [Gaede+]

CMU SCS Copyright: C. Faloutsos (2011)7 What else can they solve? separability [KDD’02] forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [Wang+’02] selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00]...

CMU SCS Copyright: C. Faloutsos (2011)8 Metric trees - analysis Problem: How many disk accesses, for an M-tree? Given: – N (# of objects) – C (fanout of disk pages) – r (radius of range query - BIASED model) Optional

CMU SCS Copyright: C. Faloutsos (2011)9 Metric trees - analysis Problem: How many disk accesses, for an M-tree? Given: – N (# of objects) – C (fanout of disk pages) – r (radius of range query - BIASED model) NOT ENOUGH - what else do we need? Optional

CMU SCS Copyright: C. Faloutsos (2011)10 Metric trees - analysis A: something about the distribution Optional

CMU SCS Copyright: C. Faloutsos (2011)11 Metric trees - analysis A: something about the distribution [Ciaccia, Patella, Zezula, PODS98]: assumed that the distance distribution is the same, for every object: Paolo CiacciaMarco Patella Optional

CMU SCS Copyright: C. Faloutsos (2011)12 Metric trees - analysis A: something about the distribution [Ciaccia+, PODS98]: assumed that the distance distribution is the same, for every object: F1(d) = Prob(an object is within d from object #1) = F2(d) =... = F(d) Optional

CMU SCS Copyright: C. Faloutsos (2011)13 Metric trees - analysis A: something about the distribution Given our ‘fractal’ tools, we could try them - which one? Optional

CMU SCS Copyright: C. Faloutsos (2011)14 Metric trees - analysis A: something about the distribution Given our ‘fractal’ tools, we could try them - which one? A: Correlation integral [Traina+, ICDE2000] Optional

CMU SCS Copyright: C. Faloutsos (2011)15 Metric trees - analysis English dictionary Portuguese dictionary log(d) log(#pairs) log(d) log(#pairs) Optional

CMU SCS Copyright: C. Faloutsos (2011)16 Metric trees - analysis log(d) log(#pairs) Divina Comedia Eigenfaces Optional

CMU SCS Copyright: C. Faloutsos (2011)17

CMU SCS Copyright: C. Faloutsos (2011)18 Metric trees - analysis Optional

CMU SCS Copyright: C. Faloutsos (2011)19 Metric trees - analysis So, what is the # of disk accesses, for a node of radius r d, on a query of radius r q ? rdrd rqrq Optional

CMU SCS Copyright: C. Faloutsos (2011)20 Metric trees - analysis So, what is the # of disk accesses, for a node of radius r d, on a query of radius r q ? A: ~ (r d +r q ).... rdrd rqrq Optional

CMU SCS Copyright: C. Faloutsos (2011)21 Metric trees - analysis So, what is the # of disk accesses, for a node of radius r d, on a query of radius r q ? A: ~ (r d +r q )^D rdrd rqrq Optional

CMU SCS Copyright: C. Faloutsos (2011)22 Accuracy of selectivity formulas log(rq) log(#d.a.) Optional

CMU SCS Copyright: C. Faloutsos (2011)23 Fast estimation of D Normally, D takes O(N^2) time Anything faster? suppose we have already built an M-tree Optional

CMU SCS Copyright: C. Faloutsos (2011)24 Fast estimation of D Hint: Optional

CMU SCS Copyright: C. Faloutsos (2011)25 Fast estimation of D Hint: ratio of radii: r1^D * C = r2^D D ~ log(C) / log(r2/r1) r1 r2 Optional

CMU SCS Copyright: C. Faloutsos (2011)26 Indexing - Detailed outline fractals –intro –applications disk accesses for R-trees (range queries) dimensionality reduction selectivity in M-trees dim. curse revisited “fat fractals” quad-tree analysis [Gaede+]

CMU SCS Copyright: C. Faloutsos (2011)27 Dim. curse revisited (Q: how serious is the dim. curse, e.g.:) Q: what is the search effort for k-nn? –given N points, in E dimensions, in an R-tree, with k-nn queries (‘biased’ model) [Pagel, Korn + ICDE 2000]

CMU SCS Copyright: C. Faloutsos (2011)28 (Overview of proofs) assume that your points are uniformly distributed in a d-dimensional manifold (= hyper-plane) derive the formulas substitute d for the fractal dimension

CMU SCS Copyright: C. Faloutsos (2011)29 Reminder: Hausdorff Dimension (D 0 ) r = side length (each dimension) B(r) = # boxes containing points  r D0 r = 1/2 B = 2r = 1/4 B = 4r = 1/8 B = 8 logr = -1 logB = 1 logr = -2 logB = 2 logr = -3 logB = 3 proof

CMU SCS Copyright: C. Faloutsos (2011)30 Reminder: Correlation Dimension (D 2 ) S(r) =  p i 2 (squared % pts in box)  r D2  #pairs( within <= r ) r = 1/2 S = 1/2r = 1/4 S = 1/4r = 1/8 S = 1/8 logr = -1 logS = -1 logr = -2 logS = -2 logr = -3 logS = -3 proof

CMU SCS Copyright: C. Faloutsos (2011)31 Observation #1 How to determine avg MBR side l? –N = #pts, C = MBR capacity Hausdorff dimension: B(r)  r D0 B(l) = N/C = l -D0  l = (N/C) -1/D0 l proof

CMU SCS Copyright: C. Faloutsos (2011)32 Observation #2 k-NN query   -range query –For k pts, what radius  do we expect? Correlation dimension: S(r)  r D2 22 proof

CMU SCS Copyright: C. Faloutsos (2011)33 Observation #3 Estimate avg # query-sensitive anchors: –How many expected q will touch avg page? –Page touch: q stabs  -dilated MBR(p) p MBR(p) q  l l  q p  proof

CMU SCS Copyright: C. Faloutsos (2011)34 Asymptotic Formula k-NN page accesses as N   –C = page capacity –D = fractal dimension (=D0 ~ D2)

CMU SCS Copyright: C. Faloutsos (2011)35 Asymptotic Formula NO mention of the embedding dimensionality!! Still have dim. curse, but on f.d. D

CMU SCS Copyright: C. Faloutsos (2011)36 Synthetic Data plane –D 0 = D 2 = 2 –embedded in E-space –N = 100K manifold –E = 8 –D 0 =D 2 varies from 1-6 –line, plane, etc. (in 8-d)

CMU SCS Copyright: C. Faloutsos (2011)37 Accuracy of L  Formula

CMU SCS Copyright: C. Faloutsos (2011)38 Embedding Dimension plane k = 50 L  dist

CMU SCS Copyright: C. Faloutsos (2011)39 Intrinsic Dimensionality manifold k = 50 L  dist

CMU SCS Copyright: C. Faloutsos (2011)40 Non-Euclidean Data Set sierpinski, k = 50, L  dist

CMU SCS Copyright: C. Faloutsos (2011)41 Conclusions Worst-case theory is over-pessimistic High dimensional data can exhibit good performance if correlated, non-uniform Many real data sets are self-similar Determinant is intrinsic dimensionality –multiple fractal dimensions (D 0 and D 2 ) –indication of how far one can go

CMU SCS Copyright: C. Faloutsos (2011)42 References Ciaccia, P., M. Patella, et al. (1998). A Cost Model for Similarity Queries in Metric Spaces. PODS. Pagel, B.-U., F. Korn, et al. (2000). Deflating the Dimensionality Curse Using Multiple Fractal Dimensions. ICDE, San Diego, CA. Traina, C., A. J. M. Traina, et al. (2000). Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. ICDE, San Diego, CA.