CMU SCS 15-826: Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Spatial and Temporal Data Mining V. Megalooikonomou Spatial Access Methods (SAMs) II (some slides are based on notes by C. Faloutsos)
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
Multidimensional Data
CMU SCS : Multimedia Databases and Data Mining Lecture #6: Spatial Access Methods Part III: R-trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Spatial Mining.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #6: Spatial Access Methods Part III: R-trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Spatial Data Management
15-826: Multimedia Databases and Data Mining
Spatial Indexing.
Spatial Indexing I Point Access Methods.
COMP 430 Intro. to Database Systems
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Multidimensional Indexes
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
R-trees: An Average Case Analysis
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
C. Faloutsos Spatial Access Methods - R-trees
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos

CMU SCS Copyright: C. Faloutsos (2011)2 Must-read Material Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, Proc. ACM SIGACT- SIGMOD-SIGART PODS, May 1994, pp. 4-13, Minneapolis, MN. Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension

CMU SCS Copyright: C. Faloutsos (2011)3 Optional Material Optional, but very useful: Manfred Schroeder Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (on reserve in the WeH library)

CMU SCS Reminder Code at Also, in ‘R’ > library(fdim); Copyright: C. Faloutsos (2011)4

CMU SCS Copyright: C. Faloutsos (2011)5 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

CMU SCS Copyright: C. Faloutsos (2011)6 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods –z-ordering –R-trees –misc fractals –intro –applications text...

CMU SCS Copyright: C. Faloutsos (2011)7 Indexing - Detailed outline fractals –intro –applications disk accesses for R-trees (range queries) dimensionality reduction selectivity in M-trees dim. curse revisited “fat fractals” quad-tree analysis [Gaede+]

CMU SCS Copyright: C. Faloutsos (2011)8 (Fractals mentioned before:) for performance analysis of R-trees fractals for dim. reduction

CMU SCS Copyright: C. Faloutsos (2011)9 Case study#1: R-tree performance Problem Given – N points in E-dim space Estimate # disk accesses for a range query (q1 x... x q E ) (assume: ‘good’ R-tree, with tight, cube-like MBRs)

CMU SCS Copyright: C. Faloutsos (2011)10 Case study#1: R-tree performance Problem Given – N points in E-dim space –with fractal dimension D Estimate # disk accesses for a range query (q1 x... x q E ) (assume: ‘good’ R-tree, with tight, cube-like MBRs) Typically, in DB Q-opt: uniformity + independence

CMU SCS Copyright: C. Faloutsos (2011)11 Examples:World’s countries BUT: area vs population for ~200 countries (1991 CIA fact-book). area pop log(area) log(pop)

CMU SCS Copyright: C. Faloutsos (2011)12 Examples:World’s countries neither uniform, nor independent! area pop log(area) log(pop)

CMU SCS Copyright: C. Faloutsos (2011)13 For fun: identification

CMU SCS Copyright: C. Faloutsos (2011)14 For fun: identification

CMU SCS Copyright: C. Faloutsos (2011)15 For fun: identification Highest density lowest density 47 residents(!)

CMU SCS Copyright: C. Faloutsos (2011)16 For fun: identification 47 residents(!)

CMU SCS Copyright: C. Faloutsos (2011)17 Examples: TIGER files neither uniform, nor independent! MG countyLB county

CMU SCS Copyright: C. Faloutsos (2011)18 How to proceed? recall the [Pagel+] formula, for range queries of size q1 x q2 #DiskAccesses(q1,q2) = sum ( x i,1 + q1) * (x i,2 + q2) But: formula needs to know the x i,j sizes of MBRs!

CMU SCS Copyright: C. Faloutsos (2011)19 How to proceed? But: formula needs to know the x i,j sizes of MBRs! Answer (jumping ahead): s = (C/N) 1/D0

CMU SCS Copyright: C. Faloutsos (2011)20 How to proceed? But: formula needs to know the x i,j sizes of MBRs! Answer (jumping ahead): s = (C/N) 1/D0 side of (parent) MBR page capacity # of data points Hausdorff fd

CMU SCS Copyright: C. Faloutsos (2011)21 Let’s see the rationale s = (C/N) 1/D0

CMU SCS Copyright: C. Faloutsos (2011)22 R-trees - performance analysis I.e: for range queries - how many disk accesses, if we just now that we have - N points in E-d space? A: can not tell! need to know distribution proof

CMU SCS Copyright: C. Faloutsos (2011)23 R-trees - performance analysis Q: OK - so we are told that the Hausdorff fractal dim. = D0 - Next step? (also know that there are at most C points per page) D0=1 D0=2 proof

CMU SCS Copyright: C. Faloutsos (2011)24 R-trees - performance analysis Assumption1: square-like parents (s*s) Assumption2: fully packed (C points each) Assumption3: non-overlapping D0=1 D0=2 s1=s2=s proof

CMU SCS Copyright: C. Faloutsos (2011)25 R-trees - performance analysis Assumption1: square-like parents (s*s) Assumption2: fully packed (N/C non-empty) Assumption3: non-overlapping D0=1 s1=s2=s proof

CMU SCS Copyright: C. Faloutsos (2011)26 R-trees - performance analysis Hint: dfn of Hausdorff f.d.: Felix Hausdorff ( ) proof

CMU SCS Copyright: C. Faloutsos (2011)27 Reminder: Hausdorff or box-counting fd: Box counting plot: Log( N ( r ) ) vs Log ( r) r: grid side N (r ): count of non-empty cells (Hausdorff) fractal dimension D0:

CMU SCS Copyright: C. Faloutsos (2011)28 Reminder Hausdorff fd: r log(r) log(#non-empty cells) D0D0 proof N/C s

CMU SCS Copyright: C. Faloutsos (2011)29 Reminder dfn of Hausdorff fd implies that N(r) ~ r (-D0) # non-empty cells of side r proof

CMU SCS Copyright: C. Faloutsos (2011)30 R-trees - performance analysis Q (rephrased): what is the side s1, s2,... of parent nodes, given N data points, packed by C, with f.d. = D0 D0=1 D0=2 proof

CMU SCS Copyright: C. Faloutsos (2011)31 R-trees - performance analysis Q (rephrased): what is the side s1, s2,... of parent nodes, given N data points, packed by C, with f.d. = D0 D0=1 D0=2 s1 s2 proof

CMU SCS Copyright: C. Faloutsos (2011)32 R-trees - performance analysis Q (rephrased): what is the side s1, s2,... of parent nodes, given N data points, packed by C, with f.d. = D0 D0=1 D0=2 s1=s2=s proof

CMU SCS Copyright: C. Faloutsos (2011)33 R-trees - performance analysis A: (educated guess) s=s1=s2 (=... ) - square-like MBRs N/C non-empty cells = K * s (-D0) D0=1 D0=2 s1 s2 log(s) log(#cells) proof

CMU SCS Copyright: C. Faloutsos (2011)34 R-trees - performance analysis Details of derivations: in [PODS 94]. Finally, expected side s of parent MBRs: s = (C/N) 1/D0 Q: sanity check: how does s change with D0? A: proof

CMU SCS Copyright: C. Faloutsos (2011)35 R-trees - performance analysis Details of derivations: in [Kamel+, PODS 94]. Finally, expected side s of parent MBRs: s = (C/N) 1/D0 Q: sanity check: how does s change with D0? A: s grows with D0 Q: does it make sense? Q: does it suffer from (intrinsic) dim. curse?

CMU SCS Copyright: C. Faloutsos (2011)36 R-trees - performance analysis Q: Final-final formula (# disk accesses for range queries q1 x q2 x... ): A:

CMU SCS Copyright: C. Faloutsos (2011)37 R-trees - performance analysis Q: Final-final formula (# disk accesses for range queries q1 x q2 x... ): A: # of parent-node accesses: N/C * (s + q1) * (s + q2 ) *... ( s + q E ) A: # of grand-parent node accesses

CMU SCS Copyright: C. Faloutsos (2011)38 R-trees - performance analysis Q: Final-final formula (# disk accesses for range queries q1 x q2 x... ): A: # of parent-node accesses: N/C * (s + q1) * (s + q2 ) *... ( s + q E ) A: # of grand-parent node accesses N/(C^2) * (s’ + q1) * (s’ + q2 ) *... ( s’ + q E ) s’ = ??

CMU SCS Copyright: C. Faloutsos (2011)39 R-trees - performance analysis Q: Final-final formula (# disk accesses for range queries q1 x q2 x... ): A: # of parent-node accesses: N/C * (s + q1) * (s + q2 ) *... ( s + q E ) A: # of grand-parent node accesses N/(C^2) * (s’ + q1) * (s’ + q2 ) *... ( s’ + q E ) s’ = (C^2/N) 1/D0

CMU SCS Copyright: C. Faloutsos (2011)40 R-trees - performance analysis Results: IUE (x-y star coordinates) # leaf accesses query side

CMU SCS Copyright: C. Faloutsos (2011)41 R-trees - performance analysis Results: LB County # leaf accesses query side

CMU SCS Copyright: C. Faloutsos (2011)42 R-trees - performance analysis Results: MG-county # leaf accesses query side

CMU SCS Copyright: C. Faloutsos (2011)43 R-trees - performance analysis Results: 2D- uniform # leaf accesses query side

CMU SCS Copyright: C. Faloutsos (2011)44 R-trees - performance analysis Conclusions: usually, <5% relative error, for range queries

CMU SCS Copyright: C. Faloutsos (2011)45 Indexing - Detailed outline fractals –intro –applications disk accesses for R-trees (range queries) dimensionality reduction selectivity in M-trees dim. curse revisited “fat fractals” quad-tree analysis [Gaede+].... Optional

CMU SCS Copyright: C. Faloutsos (2011)46 Case study #2: Dim. reduction Problem definition: ‘Feature selection’ given N points, with E dimensions keep the k most ‘informative’ dimensions [Traina+,SBBD’00] Caetano Traina Agma Traina Leejay Wu Optional

CMU SCS Copyright: C. Faloutsos (2011)47 Dim. reduction - w/ fractals not informative Optional

CMU SCS Copyright: C. Faloutsos (2011)48 Dim. reduction Problem definition: ‘Feature selection’ given N points, with E dimensions keep the k most ‘informative’ dimensions Re-phrased: spot and drop attributes with strong (non-)linear correlations Q: how do we do that? Optional

CMU SCS Copyright: C. Faloutsos (2011)49 Dim. reduction A: Hint: correlated attributes do not affect the intrinsic/fractal dimension, e.g., if y = f(x,z,w) we can drop y (hence: ‘partial fd’ (PFD) of a set of attributes = the fd of the dataset, when projected on those attributes) Optional

CMU SCS Copyright: C. Faloutsos (2011)50 Dim. reduction - w/ fractals PFD~0 PFD=1 global FD=1 Optional

CMU SCS Copyright: C. Faloutsos (2011)51 Dim. reduction - w/ fractals PFD=1 global FD=1 Optional

CMU SCS Copyright: C. Faloutsos (2011)52 Dim. reduction - w/ fractals PFD~1 global FD=1 Optional

CMU SCS Copyright: C. Faloutsos (2011)53 Dim. reduction - w/ fractals (problem: given N points in E-d, choose k best dimensions) Q: Algorithm? Optional

CMU SCS Copyright: C. Faloutsos (2011)54 Dim. reduction - w/ fractals Q: Algorithm? A: e.g., greedy - forward selection: –keep the attribute with highest partial fd –add the one that causes the highest increase in pfd –etc., until we are within epsilon from the full f.d. Optional

CMU SCS Copyright: C. Faloutsos (2011)55 Dim. reduction - w/ fractals (backward elimination: ~ reverse) –drop the attribute with least impact on the p.f.d. –repeat –until we are epsilon below the full f.d. Optional

CMU SCS Copyright: C. Faloutsos (2011)56 Dim. reduction - w/ fractals Q: what is the smallest # of attributes we should keep? Optional

CMU SCS Copyright: C. Faloutsos (2011)57 Dim. reduction - w/ fractals Q: what is the smallest # of attributes we should keep? A: we should keep at least as many as the f.d. (and probably, a few more) Optional

CMU SCS Copyright: C. Faloutsos (2011)58 Dim. reduction - w/ fractals Results: E.g., on the ‘currency’ dataset (daily exchange rates for USD, HKD, BP, FRF, DEM, JPY - i.e., 6-d vectors, one per day - base currency: CAD) e.g.: USD FRF Optional

CMU SCS Copyright: C. Faloutsos (2011)59 E.g., on the ‘currency’ dataset correlation integral log(r) log(#pairs(<=r)) Optional

CMU SCS Copyright: C. Faloutsos (2011)60 E.g., on the ‘currency’ dataset Optional

CMU SCS Copyright: C. Faloutsos (2011)61 Dim. reduction - w/ fractals Conclusion: can do non-linear dim. reduction PFD~1 global FD=1 Optional

CMU SCS Copyright: C. Faloutsos (2011)62 References [PODS94] Faloutsos, C. and I. Kamel (May 24-26, 1994). Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension. Proc. ACM SIGACT-SIGMOD-SIGART PODS, Minneapolis, MN. [Traina+, SBBD’00] Traina, C., A. Traina, et al. (2000). Fast feature selection using the fractal dimension. XV Brazilian Symposium on Databases (SBBD), Paraiba, Brazil.