Download presentation
Presentation is loading. Please wait.
Published bySonny Pedley Modified over 10 years ago
1
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1SISAP 2011, Lipari
2
Presentation outline Similarity search in metric spaces Pivot tables Clustered pivot tables Static variant Dynamic variant Experiments 2SISAP 2011, Lipari
3
Similarity search Suitable for unstructured data, query often not in DB Similarity is often modeled by a metric distance Expensive distance functions - EMD, SQFD, DTW, … Metric indexing Based on lower-bounding If abs(d(p, q) – d(p, o)) > r filter out object o 3SISAP 2011, Lipari
4
Pivot tables Simple yet efficient main memory metric index Having k static pivots P i and database S of n objects O j, pivot table stores all the distances d(P i, O j ) in the matrix of size k x n Pivot tables = two structures - distance matrix + data file Cheap filtering of non-relevant objects (lower-bounding) Non-filtered objects are refined by the original expensive distance function 4SISAP 2011, Lipari
5
Clustered pivot tables What if the pivot table does not fit into main memory? Solution 1 – just slice datafile +simple to construct - sequential scan => high I/O cost Solution 2 – reorganize and slice datafile +similar objects in one page (page = cluster) => higher probability that all objects are filtered => lower I/O cost -metric clustering is expensive 5SISAP 2011, Lipari
6
Metric clustering? M-tree! Dynamic, persistent, balanced structure Leaf node represents cluster of similar objects Many construction strategies considering quality of M-tree hierarchy with complexity < O(n 2 ) Single/Multi/Hybrid-way leaf selection Slim-down algorithm Reinsertions SISAP 2011, Lipari6
7
Static CPT Data file = objects serialized from M-tree leaves Classic pivot table reorganizing input Fixed page size in a paged data file Preserve M-tree? Future re-indexing Query processing 7SISAP 2011, Lipari
8
Dynamic CPT Data file = set of M-tree leaves Distance matrix connected to the M-tree leaves Internal fragmentation M-tree leaves contain different number of data objects, utilization is not 100% Dynamic operations do not degenerate created clusters 8SISAP 2011, Lipari
9
CPT - Querying Filtering based on lower-bounding If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari9
10
CPT - Querying problems Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari10
11
CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari11
12
CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari12 Q x Q x
13
Experiments (1) 2 real datasets subset of CoPhIR, subset of Corel 2 synthetic datasets Cloud, PolygonSet We considered more M-tree variants Single/Multi way leaf selection Reinsertions Measured I/O cost CPT vs. PT vs. M-tree 13SISAP 2011, Lipari
14
Experiments (2) 14SISAP 2011, Lipari
15
Experiments (3) 15SISAP 2011, Lipari
16
Conclusion We have designed I/O-optimized method for persistent pivot tables Future work Thorough experiments on SSD disks Use other metric clustering techniques 16SISAP 2011, Lipari
17
Thank you 17SISAP 2011, Lipari
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.