The many facets of approximate similarity search Marco Patella and Paolo Ciaccia DEIS, University of Bologna - Italy.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Traveling Salesperson Problem
Nearest Neighbor Queries using R-trees
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Data Mining Classification: Alternative Techniques
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Artificial Intelligence in Game Design Introduction to Learning.
Planning under Uncertainty
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Segmentation Divide the image into segments. Each segment:
CSE 830: Design and Theory of Algorithms
Evaluating Hypotheses
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Ant Colony Optimization Optimisation Methods. Overview.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
Randomized Algorithms - Treaps
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
Bold Stroke January 13, 2003 Advanced Algorithms CS 539/441 OR In Search Of Efficient General Solutions Joe Hoffert
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.
Network Aware Resource Allocation in Distributed Clouds.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Swarm Intelligence 虞台文.
Chapter 9 – Classification and Regression Trees
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Exact methods for ALB ALB problem can be considered as a shortest path problem The complete graph need not be developed since one can stop as soon as in.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Written by Changhyun, SON Chapter 5. Introduction to Design Optimization - 1 PART II Design Optimization.
Introduction to Artificial Intelligence (G51IAI) Dr Rong Qu Blind Searches - Introduction.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Quantum Computing MAS 725 Hartmut Klauck NTU
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 Some Guidelines for Good Research Dr Leow Wee Kheng Dept. of Computer Science.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Clustering Data Streams
SIMILARITY SEARCH The Metric Space Approach
A paper on Join Synopses for Approximate Query Answering
HW #1 Due 29/9/2008 Write Java Applet to solve Goats and Cabbage “Missionaries and cannibals” problem with the following search algorithms: Breadth first.
K Nearest Neighbor Classification
Locality Sensitive Hashing
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

The many facets of approximate similarity search Marco Patella and Paolo Ciaccia DEIS, University of Bologna - Italy

Roadmap Why? –motivation for approximate search How? –a classification schema How much? –optimality in the context of approximate search How good? –assessing the quality of results

What is approximate similarity search? Well, it’s similarity search… …but with approximation! We try to speed-up query resolution by accepting an error in the result The user is offered a quality/time trade-off

Give me the picture of a bull… When is approximating a good idea? The user perception of similarity is different wrt the one implemented by the system

When is approximating a good idea? In the early stages of an iterative search, the user may want a quick look at the data Is there any image of a bull in this collection?

When is approximating a good idea? The user might be satisfied with a “good enough” result I need refueling… Gimme a gas station within 3 miles!* QUICK! *=800 metros de los taxistas (mt)

What are you talking about? k-NN queries cost: –number of computed distances –number of accessed nodes (for disk-based techniques) quality (wrt exact result): –distance to the query object –same ordering –more on this later…

A classification schema for approximate techniques Useful to compare existing (and new) approaches –a plethora of approximate methods have been proposed over the years –usually, each technique is not put “into context” –highlights similarities between approaches –discover limitations in the applicability of some technique

The many (4!) facets of approximate similarity search Independent coordinates –data type –approximation type –quality guarantees –user interaction

Coord. I: Data type In increasing order of generality: –vector spaces, L p (Minkowski) distance Manhattan distance Euclidean distance –vector spaces, any distance correlation between coordinates is allowed e.g., quadratic forms –metric spaces triangular inequality is required

Coord. II: Approximation type How approximate techniques are able to reduce costs for similarity searches: –changing space –solving the exact problem in an “easier” space –reducing comparisons by aggressive pruning –avoid visiting regions of the space that are unlikely to (but still may) contain qualifying objects by early stopping –stopping the search before correctness of the result can be proved

Coord. III: Quality guarantees Can an approximate technique guarantee that its errors stay below a given value? –no guarantee –heuristic conditions to approximate the search –deterministic guarantees –deterministic bounds (from above) on the error –probabilistic guarantees parametric –the data follow a certain distribution –only few parameters are unknown and need to be estimated non-parametric –no assumption is made on distribution of objects –such information has to be estimated and stored –e.g., distribution of distances in an histogram

Coord. IV: User interaction Possibility given to the user to specify, at query time, the parameters for the search: –static the user cannot freely choose the parameters for query approximation e.g., maximum error –interactive not bound to a specific set of parameters can be interactively used by varying parameters at query time

Some examples… Radius shrinking –Like exact search, but the search radius (the distance to the current NN) is reduced by a factor ε –The (relative) error on distance is always ≤ ε tree node q Current k-NN shrunken radius

Radius shrinking is: Data type:VS-L p VSMS Approx.:CSRC AP RC ES Quality:NGDGPG par PG npar Interaction:SAIA

PAC queries Given parameters δ and ε –Estimate the distance of the 1-NN (using distance distribution) –Find a search radius r so that the probability of finding a 1-NN with distance ≤ r is ≤ δ –Use radius shrinking with a factor ε –Stop when an object is found at a distance ≤ r

PAC is: Data type:VS-L p VSMS Approx.:CSRC AP RC ES Quality:NGDGPG par PG npar Interaction:SAIA

Proximity searching with order permutations Linear method, similar to LAESA p pivots are chosen off-line Only a fraction f of objects is visited For each object, pivots are sorted from closest to farthest The same ordering is done for the query The order according to which points are visited is obtained by comparing how pivots are sorted –Similarity between sorted lists (Spearman coeff.)

Proximity searching with order permutations is: Data type:VS-L p VSMS Approx.:CSRC AP RC ES Quality:NGDGDG par DG npar Interaction:SAIA

Optimality of approximate search We focus here on RC ES algorithms –The only difference with exact search is early stopping This can be viewed as an on-line process –The quality improves over time –The exact result can be reached if enough time is allocated

A typical k-NN search cost distance The quality increases quickly in the first steps The correct result is found, but we still have to prove it! We proved the result correct (the quality has not increased) early stopping: distance threshold early stopping: cost threshold

What does optimality mean? Minimum distance after a given cost has been paid (distance-optimality) Least cost for reaching a given distance (cost-optimality) The scenario we consider is: –recursive conservative partitioning of the space (tree) –a compact representation of each tree node is available Which is the best way of ordering tree nodes (schedule) so as to obtain optimality?

Optimality of exact search The schedule based on MinDist is optimal for exact search –minimizes cost for producing the correct result –does not necessarily provide better results earlier cost distance MinDist schedule non-optimal schedule

Optimality of approximate search An optimal schedule is better (no worse) than any other over all distances and costs The two notions of optimality coincide cost distance

Optimality: an impossible task q NN Which is the best way of ordering nodes?

Optimality: an impossible task q NN Which is the best way of ordering nodes?

Optimality: an impossible task The problem lies in the incomplete knowledge of the nodes’ content Note that this also holds for exact search –Our notion of optimality is slightly different –As said, MinDist does not necessarily provide better results earlier… We shift our aim toward optimal-on-the- average schedules –Optimal when a random query is considered

Optimal-on-the-average schedules Cost-optimality –Given a distance threshold θ, minimize avg. cost Distance-optimality –Given a cost threshold c, minimize avg. distance We use the distance distribution G i (r) of the 1-NN of a random query in node N i G i (r) = probability to find in N i (at least) a point with distance ≤ r

Optimal-on-the-average schedules Cost-optimality –Given a distance threshold θ, minimize avg. cost –Choose, at each step, the node maximizing G i ( θ ) –Intuitively, we maximize the probability to stop Distance-optimality –Given a cost threshold c, minimize avg. distance –Choose, at each step, the node maximizing –Intuitively, we choose the node having the minimum avg. 1-NN distance

Comparing schedules Corel dataset – d vectors –4000 nodes –682 queries

Quality of results How the quality of attained results is assessed? Commonly obtained by comparing the results of approximate and exact algorithms Virtually every technique in literature proposes its own definition of result quality –lack of a common framework –difficult to compare results from different papers

An example (k=5) Exact result (ID, distance): (A, 1)(B, 2)(C, 3)(D, 4)(E, 5) Approximate result: (A, 1)(C, 3)(D, 4)(F, 5)(G, 5) How do we evaluate the quality of the approximate result?

Two families of quality measures ranking-based –compare the ranking (position) of objects between approximate and exact results may require a (costly) full ranking of the objects e.g., in the previous example we should know the position of objects F and G in the exact result inaccurate in case of ties distance-based –compare the distance to the query of approximate and exact results no additional information is required

Some examples… ranking-based –precision (fraction of exact results in the approximate result) –error on position (average difference between position of objects in the two results) distance-based –effective error (relative error on distance) –total distance ratio (ratio of sum of distances between exact and approximate results)

An example (k=5) (cont.) Exact result (ID, distance): (A, 1)(B, 2)(C, 3)(D, 4)(E, 5) Approximate result: (A, 1)(C, 3)(D, 4)(F, 5)(G, 5) –precision = 3/5 –error on position = /5*7 = 6/35 –relative error = (0 + 1/2 + 1/3 + 1/4 + 0)/5 = 13/60 –total distance ratio = ( )/( ) = 15/18

Which measure is best? Both are needed! distance of the 1st NN = 1 distance of approx. NNrank of approx. NN query 1: 22 query 2: 2100 query 3: 1002 query 4: Which query attains the best result? Application requirements might favor a quality measure over the others –e.g., distance-based for the gas station example

What’s next? Use the classification schema for new techniques –The paper contains the classification of 25 existing approaches Two underestimated facets of approximate search –Optimality of scheduling policies –Quality assessment