High-Dimensional Data. Topics Motivation Similarity Measures Index Structures.

Slides:



Advertisements
Similar presentations
Choosing Distance Measures for Mining Time Series Data
Advertisements

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
Trees for spatial indexing
On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Searching on Multi-Dimensional Data
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Mario Rodriguez Revollo School of Computer Science, UCSP SlimSS-tree: A New Tree Combined SS- tree With Slim-down Algorithm Lifang Yang, Xianglin Huang,
SASH Spatial Approximation Sample Hierarchy
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Time Series Indexing II. Time Series Data
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Dynamic Granular Locking Approach to Phantom Protection in R-trees Kaushik Chakrabarti Sharad Mehrotra Department of Computer Science University of Illinois.
Spatial Indexing SAMs.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Chapter 3: Data Storage and Access Methods
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Wavelets
San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Dimensionality Reduction
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Time Series.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Pattern Matching with Acceleration Data Pramod Vemulapalli.
Metric based KNN indexing Lecturer:Prof Ooi Beng Chin Presenters:Frankie ChanHT Y Tan ZhenqiangHT J.
Multimedia and Time-series Data
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Indexing for Multidimensional Data An Introduction.
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software.
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Observer Relative Data Extraction Linas Bukauskas 3DVDM group Aalborg University, Denmark 2001.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
File Processing : Multi-dimensional Index 2015, Spring Pusan National University Ki-Joune Li.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Singular Value Decomposition and its applications
Spatial Data Management
Real-time environment map lighting
Robust Similarity Measures for Mobile Object Trajectories
Time Series Data and Moving Object Trajectory
15-826: Multimedia Databases and Data Mining
Spatial Indexing I R-trees
Presentation transcript:

High-Dimensional Data

Topics Motivation Similarity Measures Index Structures

c e g f d A B R trees, redux We want to minimize coverage and overlap AB cdefg We descend both branches to search for

R+ Trees store d in both A and B like splitting d into two pieces AB cdedfg c e g f d A B

R* trees When a node overflows, don’t split it right away; reinsert some of its nodes AB cdefg c e g f d A B x

R* trees Normal Insertion: ABX cdfg c e g f d A B x X ex

R* trees Reinsert c instead of splitting node AB xdefgc c e g f d A B x

Curse of Dimensionality Coverage and overlap as a function of dimension? d=2 d=1 d=3

Curse of Dimensionality Generally: exponential growth of the hypervolume as a function of dimension Other manifestations: number of samples required to maintain the same accuracy number of nodes in a neural network required to “monitor” the input space lots more

High-dimensional data Finance Multimedia Sound Music (“Query by humming”) Images Video Document Retrieval Biology/Medicine DNA sequence matching Medical imagery Moving Objects [(t0,x0,y0), (t1,x1,y1), …] High-Energy Physics

High-dimensional Access Methods Three components: Similarity Measure Index Structures Search Strategy we won’t cover search strategy

Similarity Measure When are two vectors similar? Q = DB =

Similarity Measure Define a function s : V  V  Real What properties should s have? Reflexive: s(x,x) = 0 // or infinity Symmetric: s(x,y) = s(y,x) Triangle Inequality: s(x,y) + s(y,z) >= s(x,z)

Timeseries Indexing Q = A = B =

Timeseries Indexing A B C D Q

Euclidean distance Dynamic Time Warping Jagadish, Faloutsos 1998, Keogh 2002 Wavelets Miller 2003 LCSS Vlachos, Kollios, Gunopolos 2002 EDR Chen, Ozsu, Oria 2005

Euclidean Distance Q = A = =  = 7.8

Eclidean Distance (2) Q A B

Dynamic Time Warping

Dynamic Time Warping (2)

Dynamic Time Warping (3)

Dynamic Time Warping (4) Drawbacks: Sensitive to noise expensive to compute

Wavelets Fourier Transform Represents a timeseries as a sum of sine waves The coefficients of the constituent waves indicate the dominant structure

Wavelets (2) Same trick, different basis function: Sum of sine waves? Sum of Dirac delta functions? Sum of …

Wavelets (3) Haar wavelet transform s i + s i+1 s i - s i+1 Hierarchical decomposition allows fine-tuning

Wavelets (4) After one Horizontal filtering

Wavelets (5) After two vertical and horizontal filterings

Wavelets (6) Wavelets can reduce dimensionality, like Principal Component Analysis (PCA), Singular Value Decomposition (SVD), others Indexing in the reduced feature space False positives ok, False negatives aren’t Use a more refined similarity measure to eliminate false positives

Other measures Longest Common Subsequence Edit Distance on Real sequence

Index Structures SS-Tree [White, Jain 96] R*-Tree using Minimum Bounding Spheres SR-Tree [Katayama, Satoh 97] Uses MBR during construction, but MBS during lookup X-Tree [Berchtold, Kreim, Kriegel 96] R*-Tree using extended nodes to avoid splits and control maximum overlap M-Tree [Ciaccia, Patella 00] Build tree based on representative points TV-tree [Lin, Jagadish, Faloutsos 94] SR-Tree and M-Tree appear to outperform others

M-Tree

Telscoping Vector Tree (TV) node = (center, radius) dim(center) >= # of “active dimensions”