Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.

Access Methods for Advanced Database Applications.

Searching on Multi-Dimensional Data

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.

Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.

2-dimensional indexing structure

Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.

B+-tree and Hashing.

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

Accessing Spatial Data

Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.

Chapter 3: Data Storage and Access Methods

Techniques and Data Structures for Efficient Multimedia Similarity Search.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Doubling Dimension in Real-World Graphs Melitta Lorraine Geistdoerfer Andersen.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

CS4432: Database Systems II

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.

R-Trees Extension of B+-trees.  Collection of d-dimensional rectangles.  A point in d-dimensions is a trivial rectangle.

Decision Procedures An Algorithmic Point of View

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 

R ++ -tree: an efficient spatial access method for highly redundant point data Martin Šumák, Peter Gurský University of P. J. Šafárik in Košice.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Efficient Processing of Top-k Spatial Preference Queries

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Lecture 3: Uninformed Search

Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.

Spatial Data Management

Mehdi Kargar Department of Computer Science and Engineering

Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.

RE-Tree: An Efficient Index Structure for Regular Expressions

Spatial Indexing I Point Access Methods.

Data Structures: Segment Trees, Fenwick Trees

Multidimensional Indexes

Efficient Processing of Top-k Spatial Preference Queries

Donghui Zhang, Tian Xia Northeastern University

Efficient Aggregation over Objects with Extent

Presentation transcript:

Presenters: Amool Gupta Amit Sharma

MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step ahead? Basic Fundamentals of this Indexing structure

Similarity Search Problem

Similarity Searching Effectiveness - The way of formulating the similarity measures a model of human perception Efficiency - The way of achieving the required performance over huge volumes of data – index structure

Examples of Distance Functions L p metric function( vectors) L1 Manhattan distance Euclidean Distance L infinity Edit Distance (for String) Hausdorff distance Earth movers distance Quadratic form distance

Metric Spaces-An abstraction of Similarity A metric space M = (D,d) is a pair, where D is a domain (“universe”) of values, and d is a distance function that, ∀ x,y,z ∈ U, satisfies the metric axioms: d(x,y) ≥ 0, d(x,y) = 0 ⇔ x = y (positivity) d(x,y) = d(y,x) (symmetry) d(x,y) ≤ d(x,z) + d(z,y) (triangle inequality) All the distance functions seen in the previous examples are metrics, and so are the (weighted) Lp-norms The only distance seen so far that does not fit the metric framework is the DTW Metric indexes only use the metric axioms to organize objects, and exploit the triangle inequality to prune the search space

Limitations of SAMs SAMs are limited to indexing of DB Objects represented by means of feature values in Multi-dimensional vector space (we need more generic indexing strategy) Dissimilarity of object measured by Lp distance between feature values Assumes distance computation Trivial Limitations of Metric Tress Does not support dynamic database environment Reduces distance computations but Pays no attention to I/O costs

What is a relative distance? OA + AB = OB AB = OA – OB AB = relative position of B w.r.t A O A B

M-Tree Key ideas is to Some how reduce distance computation and at same time reduce I/O. M-Tree partition objects on the basis of their relative distance as measured by specific distance function and stores this objects into nodes. root P(O r ) OrOr r(O r )

M-Tree Structure Leaf Nodes: stores all indexed db objects by their key or feature values. Internal Nodes: Called routing nodes. Routing objects Or is associated with Or feature value of DB object Ptr(T(Or)) = pointer to root of sub tree T(Or) r(Or) = covering radius or maximum relative distance of objects in sub tree T(Or) from routing object Or d(Or, P(Or)) = distance of routing object from its parent object P(Or)

M-Tree Structure Leaf Node: Entry for database object. Oj feature value of DB object. oid(Oj) object key d(Oj P(Oj)) = distance of Oj from its parent P(Oj)

Processing Queries Generally SAM try to prune tree for a given Query and main emphasis is on developing efficient pruning method which reduces no of disk access but once a tree is pruned it is required to compute distance of query point Q from each point in pruned tree. On the contrary emphasis of M- Tree is on pruning as well as to reduce computation of distance which is achieved by maximizing use of pre computed distance stored in nodes of M-Tree

Range Query Given query point Q, Maximum search distance r(Q) Range query range(Q, r(Q)) is all objects O j such that d(O j, Q ) < r(Q) root P(O r ) OrOr r(O r ) Q r(Q) O r is of our interest if intersection occurs How To detect intersection using pre Computed distances? If relative distance between Q and O r is Less then sum of covering radii of two Intersection is found.

Range Query Leaf node root P(O r ) OjOj Q r(Q) Object in leaf node is a solution to range Query if it lies in its covering radii. We can again use relative distance to Find weather object lies in covering radii Or not

Algorithm for Range Queries

K nearest neighbors queries Given query point Q, An integer k > = 1 k-NN is NN(Q,k) is k indexed objects which have shortest distance to Q root P(O r ) OrOr r(O r ) Q Min Bound Max Bound

SPLIT MANAGEMENT M-Tree grows bottom-up fashion Overflow of node N is managed by splitting N into two new nodes N and N’(newly created) PARTITIONING: Distributing entries are among N and N’ PROMOTE: Two entries are promoted as routing objects and moved to parent level

SPLIT MANAGEMENT If the split node is a leaf, then the covering radius of a promoted object, say O p1, is set to r(O p1 ) = max{d(O j,O p1 )|O j ∈ N1} whereas if overflow occurs in an internal node r(O p1 ) = max{d(O r,O p1 ) + r(O r )|O r ∈ N1}

SPLIT POLICIES Specific implementation of Promote and Partition method defines a split policy Ideal split policy should promote two objects and partition other objects so obtained regions have - Minimum volume - Minimum Overlap How it is different from SAM??

PROMOTE: Choosing Routing objects M_RAD minimum Radii sum mM_RAD minimizes maximum of two Radii M_LB_DIST maximum lower bound on distance RANDOM SAMPLING

PARTITIONING-Distribution of Entries Generalized Hyperplane(Unbalanced split (why?)): Assign each object Oj ∈ N to the nearest routing object: if d(Oj,Op1 ) ≤ d(Oj,Op2 ) then assign Oj to N1, else assign Oj to N2. Balanced: Compute d(Oj,Op1) and d(Oj,Op2 ) for all Oj ∈ N. Repeat until N is empty: Assign to N1 the nearest neighbor of Op1 in N and remove it from N; Assign to N2 the nearest neighbor of Op2 in N and remove it from N.

Experimental Results Assumed constant node size Tested all split policies Results Balanced partition method has shown to put significant overhead and increased th I/O cost Fastest split policy observed to be RANDOM and slowest m_RAD Average volume covered per page(quality of tree construction) M_LB_DIST proved effective

Experimental Results(2)

I/O cost

Avg Volume per page

I/O cost

I/O cost for M-Tree & R*-Tree

Thanks