A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego
Similarity Search in Databases o Objects are “unknown”, only distances are “well known” o Metric Space assumption: Identity Symmetry Triangular inequality o Distance functions include: Minkowski distances, edit and Jaccard distance,... o Applications include: Images, 3D shapes, medical data, text, dna sequences, graphs, etc. o Metric space indexing works better than multidimensional indexing. Content-Based Image Retrieval query:
Distributed Similarity Search System Index of MM objects Unit 1 Parallel & Distributed CBIR System Front-end of the CBIR System Index of MM objects Unit 2 Index of MM objects Unit n Top-K queries Search cost is close to O( |DB| ) !!!!
Distributed Similarity Search System Index of MM objects Unit 1 Parallel & Distributed CBIR System Front-end of the CBIR System Index of MM objects Unit 2 Index of MM objects Unit n Top-K queries & Cached Metric Cache
Distributed Similarity Search System Index of MM objects Unit 1 Parallel & Distributed CBIR System Front-end of the CBIR System Index of MM objects Unit 2 Index of MM objects Unit n Top-K queries & Cached Metric Cache What’s different in ?
o The cache stores result-objects, not only result-pointers e.g.: documents vs. documents ids o The cache is a peculiar sample of the whole dataset the set of objects most recently seen by the users (= most interesting !?!) o Claim: An interesting object may be used to answer approximately if it is sufficiently similar to the query. What’s different in ? Metric Cache
…and… o Queries may be approximate ! o [Zobel et al. CIVR 07] At least 8% of the images in the web are near-duplicates. Most of them are due to cropping, contrast adjustment, etc. o Requirement: the system must be robust w.r.t. near-duplicate queries. What’s different in ? Metric Cache
What’s different in ? Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System q1q1 Exact answer Approx. answer q2q2 q3q3 Exact answer
Cache Miss Approximate Hit Cache Hit Two algorithms: RCache vs. QCache RCache(q,k) If q Cache return R R = Cache.knn(q,k) If quality(R) > return R else R = DB.knn(q,k) Cache.add(q,R) return R In case of approximate hit, the cached query q’, being the closest to q, is marked as used. The Least Recently Used query and its results are evicted.
Cache Miss Approximate Hit Cache Hit Costs of RCache vs. QCache RCache(q,k) Hash table access : O(1) Search among all the result objects: O( | Cache | ) Search among all the objects in the database: O( | DB | ) |Cache| i s the number of cached objects, and |DB| is the size of the database.
Cache Miss Approximate Hit Cache Hit Two algorithms: RCache vs. QCache RCache(q,k) If q Cache return R R = Cache.knn(q,k) If quality(R) > return R else R = DB.knn(q,k) Cache.add(q,R) return R QCache(q,k) If q Cache return R Q * = Cache.knn(q, ) R * = {results in Q * } R = R *.knn(q,k) If quality( R ) > return R else R = DB.knn(q,k) Cache.add(q,R) return R In case of approximate hit, the cached query q’, being the closest to q, is marked as used. The Least Recently Used query and its results are evicted.
Cache Miss Approximate Hit Cache Hit Costs of RCache vs. QCache RCache(q,k) Hash table access : O(1) Search among all the result objects: O( | Cache | ) Search among all the objects in the database: O( | DB | ) QCache(q,k) Search among the query objects: O( |Cache|/k ) Supposing k results are stored for each query. |Cache| i s the number of cached objects, and |DB| is the size of the database.
Approximation & Guarantees o Let the safe range be: s = r* - d( q, q * ) o The cached k* objects within distance s, are the true top-k* of the new query. o Every cached query may provide some additional guarantee. q* r* q s
Experimental setup o A collection of 1,000,000 images downloaded from Flickr: we extracted 5 MPEG-7 descriptors, which were used to measure similarity. o A query log of 100,000 images: a random sample with replacement, using image views 20% training – 80% testing o k = 20, = 10 o Quality function is: Safe range 0
Hit ratio
Throughput
Approximation quality I
Approximation quality II
What we also did... o Take queries from a different collection. o Inject duplicates in the query log. o Use an expectation of RES as quality measure.
Approximation quality III o Bug: RES=0.07 o Portrait: RES=0.09 o Sunset: RES=0.12
Approximation quality III o Bug: RES=0.07 o Portrait: RES=0.09 o Sunset: RES=0.12
Acknowledgements o The European Project SAPIR o P. Zezula and his colleagues for the M-Tree implementation o The dataset used is available at cophir.isti.cnr.it
Thank you.
Backup slides