MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh Supercomputing Center) Christos Faloutsos (Carnegie Mellon University)
Outline Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader? Proposed Method Problem Formulation Theorems Experimental Results Discussion & Conclusion
Query-by-Example: an example Searching “mildly overweighted” patients The doctor selects examples by browsing patient database
Query-by-Example: an example Searching “mildly overweighted” patients : very good : good The doctor selects examples by browsing patient database
Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database
Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database
Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database The examples have “oblique” correlation
Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database The examples have “oblique” correlation We can “guess” the implied query
Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database The examples have “oblique” correlation We can “guess” the implied query q
Query-by-Example: the question Assume that user gives multiple examples user optionally assigns scores to the examples samples have spatial correlation
Query-by-Example: the question Assume that user gives multiple examples user optionally assigns scores to the examples samples have spatial correlation How can we “guess” the implied query?
Outline Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader? Proposed Method Problem Formulation Theorems Experimental Results Discussion & Conclusion
Our Approach Automatically derive distance measure from the given examples Two important notions: 1. diagonal query: isosurfaces of queries have ellipsoid shapes 2. multiple-level scores: user can specify “goodness scores” on samples
Isosurfaces of Distance Functions Euclideanweighted Euclidean generalized ellipsoid distance q qq
Distance Function Formulas Euclidean D (x, q) = (x – q) 2 Weighted Euclidean D (x, q) = i m i (x i – q i ) 2 Generalized ellipsoid distance D (x, q) = (x – q) T M (x – q)
Outline Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader? Proposed Method Problem Formulation Theorems Experimental Results Discussion & Conclusion
Relevance Feedback Popular method in IR Query is modified based on relevance judgment from the user Two major approaches 1. query-point movement 2. re-weighting
Relevance Feedback — Query-point Movement — Query point is moved towards “good” examples — Rocchio’s formula in IR Q0Q0 Q 0 : query point
Relevance Feedback — Query-point Movement — Query point is moved towards “good” examples — Rocchio’s formula in IR Q 0 : query point : retrieved data Q0Q0
Relevance Feedback — Query-point Movement — Query point is moved towards “good” examples — Rocchio’s formula in IR Q 0 : query point : retrieved data : relevance judgments Q0Q0
Relevance Feedback — Query-point Movement — Query point is moved towards “good” examples — Rocchio’s formula in IR Q1Q1 Q 0 : query point : retrieved data : relevance judgments Q 1 : new query point Q0Q0
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important f2f2 f1f1
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important f1f1 f2f2
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important f1f1 f2f2
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important “bad” feature “good” feature f1f1 f2f2
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important For each feature, weight w i = 1/ i is assigned “bad” feature “good” feature f1f1 f2f2
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important For each feature, weight w i = 1/ i is assigned “bad” feature “good” feature f1f1 f2f2 Implied Query
Relevance Feedback — Re-weighting — Standard Deviation Method in MARS (UIUC) image retrieval system Assumption: the deviation is high the feature is not important For each feature, weight w i = 1/ j is assigned MARS didn’t provide any justification for this formula “bad” feature “good” feature f1f1 f2f2 Implied Query
Outline zBackground & Introduction pQuery by Example pOur Approach pRelevance Feedback pWhat’s New in MindReader? zProposed Method pProblem Formulation pTheorems zExperimental Results zDiscussion & Conclusion
What’s New in MindReader? MindReader does not use ad-hoc heuristics cf. Rocchio’s expression, re-weighting in MARS can handle multiple levels of scores can derive generalized ellipsoid distance
What’s New in MindReader? MindReader can derive generalized ellipsoid distances q
Isosurfaces of Distance Functions Euclideanweighted Euclidean generalized ellipsoid distance q qq
Isosurfaces of Distance Functions Euclidean Rocchio weighted Euclidean generalized ellipsoid distance q qq
Isosurfaces of Distance Functions Euclidean Rocchio weighted Euclidean MARS generalized ellipsoid distance q qq
Isosurfaces of Distance Functions Euclidean Rocchio weighted Euclidean MARS generalized ellipsoid distance MindReader q qq
Outline Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader? Proposed Method Problem Formulation Theorems Experimental Results Discussion & Conclusion
Method: distance function Generalized ellipsoid distance function D (x, q) = (x – q) T M (x – q), or D (x, q) = j k m jk (x j – q j ) (x k – q k ) q : query point vector x : data point vector M = [m jk ] : symmetric distance matrix
Method: definitions N : no. of samples n : no. of dimensions (features) x i : n -d sample data vectors x i = [x i1, …, x in ] T X : N×n sample data matrix X = [x 1, …, x N ] T v : N -d score vector v = [v 1, …, v N ]
Method: problem formulation Problem formulation Given N sample n -d vectors multiple-level scores (optional) Estimate optimal distance matrix M optimal new query point q
Method: optimality How do we measure “optimality”? minimization of “penalty” What is the “penalty”? weighted sum of distances between query point and sample vectors Therefore, minimize i (x i – q) T M (x i – q) under the constraint det(M) = 1
Outline Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader? Proposed Method Problem Formulation Theorems Experimental Results Discussion & Conclusion
Theorems: theorem 1 Solved with Lagrange multipliers Theorem 1: optimal query point q = x = [x 1, …, x n ] T = X T v / v i poptimal query point is the weighted average of sample data vectors
Theorems: theorem 2 & 3 Theorem 2: optimal distance matrix M = (det(C)) 1/n C –1 C = [c jk ] is the weighted covariance matrix c jk = v i (x ik - x k ) (x ij - x j ) Theorem 3 If we restrict M to diagonal matrix, our method is equal to standard deviation method MindReader includes MARS!
Outline Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader? Proposed Method Problem Formulation Theorems Experimental Results Discussion & Conclusion
Experiments 1. Estimation of optimal distance function Can MindReader estimate target distance matrix M hidden appropriately? Based on synthetic data Comparison with standard deviation method 2. Query-point movement 3. Application to real data sets GIS data
Experiment 1: target data Two-dimensional normal distribution
Experiment 1: idea Assume that the user has “hidden” distance M hidden in his mind Simulate iterative query refinement Q: How fast can we discover “hidden” distance? Query point is fixed to (0, 0)
Experiment 1: iteration steps 1. Make initial samples: compute k -NNs with Euclidean distance 2. For each object x, calculate its score that reflects the hidden distance M hidden 3. MindReader estimates the matrix M 4. Retrieve k -NNs with the derived matrix M 5. If the result is improved, go to step 2
Experiment 1: scores Calculation of scores in terms of “hidden” distance function 1. Calculate distance from the query point q based on hidden distance matrix M hidden d = D (x, q) (0 v 2. Translate distance value d to score (- v us = exp(-d 2 /2) v = log s / (1 - s)
Experiment 1: evaluation measures Used to check whether the query result is improved or not CD- k measure CD stands for “cumulative distance” for k -NNs retrieved by matrix M, compute actual distance by matrix M hidden then take summation
Experiment 1: final k -NNs Ellipse: isosurface for M hidden Red points: final k - NNs obtained by standard deviation method Green points: final k - NNs obtained by MindReader
Experiment 1: speed of convergence x-axis: no. of iterations y-axis: CD- k measure value Red : standard deviation method Green: MindReader Blue: best CD- k value possible for the data set
Experiment 1: changes of isosurfaces After 0th and 2nd iterations
Experiment 1: changes of isosurfaces After 4th and 8th iterations
Experiment 2: query-point movement Starts from query point (0.5, 0.5) MindReader converges to M hidden with five iterations
Experiment 3: real data set End-points of road segments from the Montgomery County, MD Data is normalized to [-1, 1] [-1, 1] The query specifies five points along route I-270 Can we estimate good distance function?
Experiment 3: isosurfaces After 0th and 2nd iterations: fast convergence!
Discussion: efficiency Don’t worry about speed! ellipsoid query support using spatial access methods: Seidl & Kriegel [VLDB97] Ankerst, Branmüller, Kriegel, Seidl [VLDB98] for the derived distance, we can efficiently use spatial index
Conclusion MindReader automatically guess diagonal queries from the given examples multiple levels of scores includes “Rocchio” and “MARS” (standard deviation method) problem formulation & solution evaluation based on the experiments