Presenters: Amool Gupta Amit Sharma
MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step ahead? Basic Fundamentals of this Indexing structure
Similarity Search Problem
Similarity Searching Effectiveness - The way of formulating the similarity measures a model of human perception Efficiency - The way of achieving the required performance over huge volumes of data – index structure
Examples of Distance Functions L p metric function( vectors) L1 Manhattan distance Euclidean Distance L infinity Edit Distance (for String) Hausdorff distance Earth movers distance Quadratic form distance
Metric Spaces-An abstraction of Similarity A metric space M = (D,d) is a pair, where D is a domain (“universe”) of values, and d is a distance function that, ∀ x,y,z ∈ U, satisfies the metric axioms: d(x,y) ≥ 0, d(x,y) = 0 ⇔ x = y (positivity) d(x,y) = d(y,x) (symmetry) d(x,y) ≤ d(x,z) + d(z,y) (triangle inequality) All the distance functions seen in the previous examples are metrics, and so are the (weighted) Lp-norms The only distance seen so far that does not fit the metric framework is the DTW Metric indexes only use the metric axioms to organize objects, and exploit the triangle inequality to prune the search space
Limitations of SAMs SAMs are limited to indexing of DB Objects represented by means of feature values in Multi-dimensional vector space (we need more generic indexing strategy) Dissimilarity of object measured by Lp distance between feature values Assumes distance computation Trivial Limitations of Metric Tress Does not support dynamic database environment Reduces distance computations but Pays no attention to I/O costs
What is a relative distance? OA + AB = OB AB = OA – OB AB = relative position of B w.r.t A O A B
M-Tree Key ideas is to Some how reduce distance computation and at same time reduce I/O. M-Tree partition objects on the basis of their relative distance as measured by specific distance function and stores this objects into nodes. root P(O r ) OrOr r(O r )
M-Tree Structure Leaf Nodes: stores all indexed db objects by their key or feature values. Internal Nodes: Called routing nodes. Routing objects Or is associated with Or feature value of DB object Ptr(T(Or)) = pointer to root of sub tree T(Or) r(Or) = covering radius or maximum relative distance of objects in sub tree T(Or) from routing object Or d(Or, P(Or)) = distance of routing object from its parent object P(Or)
M-Tree Structure Leaf Node: Entry for database object. Oj feature value of DB object. oid(Oj) object key d(Oj P(Oj)) = distance of Oj from its parent P(Oj)
Processing Queries Generally SAM try to prune tree for a given Query and main emphasis is on developing efficient pruning method which reduces no of disk access but once a tree is pruned it is required to compute distance of query point Q from each point in pruned tree. On the contrary emphasis of M- Tree is on pruning as well as to reduce computation of distance which is achieved by maximizing use of pre computed distance stored in nodes of M-Tree
Range Query Given query point Q, Maximum search distance r(Q) Range query range(Q, r(Q)) is all objects O j such that d(O j, Q ) < r(Q) root P(O r ) OrOr r(O r ) Q r(Q) O r is of our interest if intersection occurs How To detect intersection using pre Computed distances? If relative distance between Q and O r is Less then sum of covering radii of two Intersection is found.
Range Query Leaf node root P(O r ) OjOj Q r(Q) Object in leaf node is a solution to range Query if it lies in its covering radii. We can again use relative distance to Find weather object lies in covering radii Or not
Algorithm for Range Queries
K nearest neighbors queries Given query point Q, An integer k > = 1 k-NN is NN(Q,k) is k indexed objects which have shortest distance to Q root P(O r ) OrOr r(O r ) Q Min Bound Max Bound
SPLIT MANAGEMENT M-Tree grows bottom-up fashion Overflow of node N is managed by splitting N into two new nodes N and N’(newly created) PARTITIONING: Distributing entries are among N and N’ PROMOTE: Two entries are promoted as routing objects and moved to parent level
SPLIT MANAGEMENT If the split node is a leaf, then the covering radius of a promoted object, say O p1, is set to r(O p1 ) = max{d(O j,O p1 )|O j ∈ N1} whereas if overflow occurs in an internal node r(O p1 ) = max{d(O r,O p1 ) + r(O r )|O r ∈ N1}
SPLIT POLICIES Specific implementation of Promote and Partition method defines a split policy Ideal split policy should promote two objects and partition other objects so obtained regions have - Minimum volume - Minimum Overlap How it is different from SAM??
PROMOTE: Choosing Routing objects M_RAD minimum Radii sum mM_RAD minimizes maximum of two Radii M_LB_DIST maximum lower bound on distance RANDOM SAMPLING
PARTITIONING-Distribution of Entries Generalized Hyperplane(Unbalanced split (why?)): Assign each object Oj ∈ N to the nearest routing object: if d(Oj,Op1 ) ≤ d(Oj,Op2 ) then assign Oj to N1, else assign Oj to N2. Balanced: Compute d(Oj,Op1) and d(Oj,Op2 ) for all Oj ∈ N. Repeat until N is empty: Assign to N1 the nearest neighbor of Op1 in N and remove it from N; Assign to N2 the nearest neighbor of Op2 in N and remove it from N.
Experimental Results Assumed constant node size Tested all split policies Results Balanced partition method has shown to put significant overhead and increased th I/O cost Fastest split policy observed to be RANDOM and slowest m_RAD Average volume covered per page(quality of tree construction) M_LB_DIST proved effective
Experimental Results(2)
I/O cost
Avg Volume per page
I/O cost
I/O cost for M-Tree & R*-Tree
Thanks