Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data
Feature Based Similarity
Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.
Join Applications: Catalogue Matching E.g. Astronomic catalogues S R
Join Applications: Clustering Clustering (e.g. DBSCAN) Similarity self-join
Grid partitioning General idea: Grid approximation where grid line distance = e Similar idea in the e-kdB-tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] Disadvantage of any grid approach: Number of neighboring grid cells: 3d - 1
Scalability of the e-kdB-tree Assumption: 2 adjacent e-stripes fit in main mem. Unrealistic for large data sets which are ... clustered, skewed and high-dimensional data
Epsilon Grid Order
e-Grid-Order Is a Total Strict Order Irreflexivity Transitivity Asymmetry e-grid-order can be used in any sorting algorithm
e-Interval Coarse approximation of join mates: Used for I/O processing
I/O Processing for the Self Join Decompose the sorted file into I/O units
Epsilon Grid Order
CPU Processing I/O units are further decomposed before joining Simple divide-and-conquer: No further sorting Decomposition: maximize active dimensions
CPU Processing Point distance computations: Order of dimensions Neighboring inactive dimensions Unspecified dimensions Active dimension Aligned inactive dimensions
Experimental Results 8-dimensional uniformly distributed vectors
Experimental Results (2) 16-d feature vectors from CAD application
Conclusions Summary Future research potential High potential for performance gains of the similarity join by page capacity optimization Necessary to separately optimize I/O and CPU Future research potential Similarity join for metric index structures Approximate similarity join Parallel similarity join algorithms