Learning for Efficient Retrieval of Structured Data with Noisy Queries

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Searching on Multi-Dimensional Data
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Probabilistic video stabilization using Kalman filtering and mosaicking.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Based on Slides by D. Gunopulos (UCR)
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Benk Erika Kelemen Zsolt
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
A Music Search Engine for Plagiarism Detection
Machine Learning: Ensemble Methods
Time Series and Dynamic Time Warping
General-Purpose Learning Machine
Recursion Topic 5.
SIMILARITY SEARCH The Metric Space Approach
Computational Geometry
k-Nearest neighbors and decision tree
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Data Science Algorithms: The Basic Methods
The Design and Analysis of Algorithms
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Influence sets based on Reverse Nearest Neighbor Queries
Challenges in Creating an Automated Protein Structure Metaserver
Sequence comparison: Local alignment
Review Graph Directed Graph Undirected Graph Sub-Graph
Distance Functions for Sequence Data and Time Series
Multiple Alignment and Phylogenetic Trees
Sequence comparison: Significance of similarity scores
K Nearest Neighbor Classification
B+-Trees and Static Hashing
MURI Kickoff Meeting Randolph L. Moses November, 2008
RL for Large State Spaces: Value Function Approximation
COSC 4335: Other Classification Techniques
Computer Vision Chapter 4
Lecture 6: Data Quality and Pandas
Scale-Space Representation for Matching of 3D Models
Indexing 1.
An Infant Facial Expression Recognition System Based on Moment Feature Extraction C. Y. Fang, H. W. Lin, S. W. Chen Department of Computer Science and.
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
Group 9 – Data Mining: Data
Minwise Hashing and Efficient Search
ICS 353: Design and Analysis of Algorithms
Tree-Structured Indexes
David Kauchak CS51A Spring 2019
Hash Maps Introduction
Data Mining CSCI 307, Spring 2019 Lecture 21
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Mathematical Induction II
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Learning for Efficient Retrieval of Structured Data with Noisy Queries Charles Parker, Alan Fern, Prasad Tadepalli Oregon State University

Structured Data Retrieval: The Problem Vast collections of structured data (images, music, video) Development of noise-tolerant retrieval tools Query-by-content Accurate as well as efficient retrieval

Sequence Alignment: Introduction Given a query sequence and a set of targets, choose the best matching (correct) target for the given query. Useful in many applications Protein secondary structure prediction Speech recognition Plant identification (Sinha, et. al.) Query-by-humming

Obligatory Overview Slide Sequence Alignment Basics Metric Access Methods The Triangular Inequality VP-Trees Boosting for Efficiency Results and Conclusion

Sequence Alignment: Basics Having a query sequence and a target sequence we can align the two sequences. Matching (or close) characters should align – characters only present in one or the other should not. Suppose we have query = “DRAIN” and target = “CABIN” . . .

Sequence Alignment: Alignment Costs Scoring the alignment for evaluation Suppose we have a function c that gives us costs for edit operations: c(a, b) = 3 if a = b and 0 if other non-null character c(-, b) = 1 c(a, -) = 1 The alignment below has a cost of 13

The Dynamic Time Warping (Smith-Waterman) algorithm Find best reward path from any point in target and query Fill the values in the matrix using the following equations starting from (0,0) C A B I N D 1 2 3 4 5 R 6 7 8 9 10 11 13 Dynamic time-warping may be a more popular name for this. I think “backtrack for solution” could be misleading. I describe it as “dynamic programming” approach that starts with the base case align(0,0,t,q) = 0 and fills up the table. You might want to add align(0,0,t,q) = 0 as a base case to the equation.

Gradient Boosting: Learning Distance Functions Define the margin to be the score of the correct target minus the score of the highest scoring incorrect target Formulate a loss function according to this definition of margin Take the gradient of this function at each possible “replacement” that occurs in the training data. Iteratively move in this direction

Metric Access Methods: Overview Accuracy is not enough! Avoidance of linear search Ability to cull subsets with the computation of a single distance. Use of the triangular inequality Use of previous work to assure some level of satisfaction

The Triangular Inequality (Skopal, 2006) Need to increase small distances while large ones stay the same Applying a concave function does the job Function moves distances within the same range Could create a useless metric space Use line search to assure optimality

The Triangular Inequality: Concave Function Application

Vantage Point Trees: Overview Given a set S Select a “vantage point” v in S. Split s into sl and sr according to distance from v Call recursively on sl and sr Builds balanced binary tree

Vantage Point Trees: Demonstration B D H G E A F C

Vantage Point Trees: Demonstration B D H G E A F C

Vantage Point Trees: Demonstration F C B D H

Vantage Point Trees: Demonstration F C B D H

Vantage Point Trees: Demonstration F C B D H G H A E C F D B

Vantage Point Trees: Demonstration F C B D H G A F E C D B H

Vantage Point Trees: Demonstration F C B D H G A F E C D B H C B A F G E D H

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ D H G A F E C D B H

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ Q A E G F C B D H G A F E C D B H

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ Q A E G F C B D H G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ Q A E G F C B D H d(Q,Sl) < t G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t d(A,Sl) < m G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage-Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t d(A,Sl) < m G A F E C D B H d(Q, A) > d(A,Sl) + d(Q,Sl)

Vantage-Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t d(A,Sl) < m G A F E C D B H d(Q, A) > d(A,Sl) + d(Q,Sl) but according to the triangular inequality, d(Q, A) <= d(A,Sl) + d(Q,Sl)

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H C D B H Thus we have a contradiction and can eliminate Sl from consideration

Vantage Point Trees: Optimizing Similar proof for d(Q, A) < m – t However, if m – t < d(Q, A) < m + t, we can do nothing, and must search linearly If there are no nearest neighbors within t, no guarantees t should be as small as possible . . . . . . or, target/query distances should be as far as possible to the correct side of the median distance

Boosting for Efficiency: Summary Create an instance of a metric access data structure given a target set Define a loss function particular to that structure Take the gradient of this loss function Use the gradient to tune the distance metric to the structure

Boosting for Efficiency: vp-tree-based Loss Function Sum loss for each training query and each target along path to correct target Scale loss according to “cost” of mistake i is training query, j is target along path vij is “left or right”, mij is median fij(a,b) is replacement count for jth target in the ith training query’s path

Boosting for Efficiency: Gradient Expression Retains properties of the accuracy gradient (easy computation, ability to approximate) Expresses desired notion of loss and margin

Synthetic Domain: Summary Targets are sequences of five tuples (x, y) with domains (0,9) and (0,29) respectively Queries generated by moving sequentially through With p = 0.3, generate a random query event with y >=15 (insert) Else, if target y is < 15, generate match. If target y is >= 15, skip to next target element (delete). Matches are (x1, y1) -> (x1, (y1 + 1) % 30) Domain is engineered to have structure I wonder if it may be good to show the synthetic domain first and then show the Query by humming results.

Query-by-Humming: Summary Database of songs (targets) Can be queried aurally Applications Commercial Entertainment Legal Enables music to be queried on it’s own terms

Query-by-Humming: Basic techniques for query processing Queries require preprocessing Filtering Pitch detection Note segmentation Targets, too! Polyphonic transcription Melody spotting We end up with a sequence of tuples (pitch, time)

Application to Query-by-Humming: Our Data Set 587 Queries 50 Subjects (college and church choirs) 12 Query songs Queries split for training and test

Results: Experimental Setup First 30 iterations: Accuracy boosting only Construct VP-Tree Compare two methods for second 30 iterations Accuracy only Accuracy + efficiency (May require rebuild of tree) Efficiency is measured by plotting % of target set culled vs. error Vary t Would like low error and high % culled In reality, lowering t introduces error

Results: Query-by-Humming Domain

Results: Synthetic Domain

Conclusion Designed a way to specialize a metric to a metric data structure Showed empirically that accuracy is not enough Showed successful “twisting” of the metric space

Future Work Extending to other types of structured data Extending to other metric access methods Some are better (metric trees) Some are worse (cover trees) Use as general solution to structured prediction problem Use in automated planning and reinforcement learning

Fin.