Learning for Efficient Retrieval of Structured Data with Noisy Queries

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

Searching on Multi-Dimensional Data

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

Probabilistic video stabilization using Kalman filtering and mosaicking.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Based on Slides by D. Gunopulos (UCR)

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Benk Erika Kelemen Zsolt

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

A Music Search Engine for Plagiarism Detection

Machine Learning: Ensemble Methods

Time Series and Dynamic Time Warping

General-Purpose Learning Machine

Recursion Topic 5.

SIMILARITY SEARCH The Metric Space Approach

Computational Geometry

k-Nearest neighbors and decision tree

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Data Science Algorithms: The Basic Methods

The Design and Analysis of Algorithms

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Influence sets based on Reverse Nearest Neighbor Queries

Challenges in Creating an Automated Protein Structure Metaserver

Sequence comparison: Local alignment

Review Graph Directed Graph Undirected Graph Sub-Graph

Distance Functions for Sequence Data and Time Series

Multiple Alignment and Phylogenetic Trees

Sequence comparison: Significance of similarity scores

K Nearest Neighbor Classification

B+-Trees and Static Hashing

MURI Kickoff Meeting Randolph L. Moses November, 2008

RL for Large State Spaces: Value Function Approximation

COSC 4335: Other Classification Techniques

Computer Vision Chapter 4

Lecture 6: Data Quality and Pandas

Scale-Space Representation for Matching of 3D Models

An Infant Facial Expression Recognition System Based on Moment Feature Extraction C. Y. Fang, H. W. Lin, S. W. Chen Department of Computer Science and.

Sequence comparison: Multiple testing correction

Sequence comparison: Significance of similarity scores

Group 9 – Data Mining: Data

Minwise Hashing and Efficient Search

ICS 353: Design and Analysis of Algorithms

Tree-Structured Indexes

David Kauchak CS51A Spring 2019

Hash Maps Introduction

Data Mining CSCI 307, Spring 2019 Lecture 21

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Mathematical Induction II

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Learning for Efficient Retrieval of Structured Data with Noisy Queries Charles Parker, Alan Fern, Prasad Tadepalli Oregon State University

Structured Data Retrieval: The Problem Vast collections of structured data (images, music, video) Development of noise-tolerant retrieval tools Query-by-content Accurate as well as efficient retrieval

Sequence Alignment: Introduction Given a query sequence and a set of targets, choose the best matching (correct) target for the given query. Useful in many applications Protein secondary structure prediction Speech recognition Plant identification (Sinha, et. al.) Query-by-humming

Obligatory Overview Slide Sequence Alignment Basics Metric Access Methods The Triangular Inequality VP-Trees Boosting for Efficiency Results and Conclusion

Sequence Alignment: Basics Having a query sequence and a target sequence we can align the two sequences. Matching (or close) characters should align – characters only present in one or the other should not. Suppose we have query = “DRAIN” and target = “CABIN” . . .

Sequence Alignment: Alignment Costs Scoring the alignment for evaluation Suppose we have a function c that gives us costs for edit operations: c(a, b) = 3 if a = b and 0 if other non-null character c(-, b) = 1 c(a, -) = 1 The alignment below has a cost of 13

The Dynamic Time Warping (Smith-Waterman) algorithm Find best reward path from any point in target and query Fill the values in the matrix using the following equations starting from (0,0) C A B I N D 1 2 3 4 5 R 6 7 8 9 10 11 13 Dynamic time-warping may be a more popular name for this. I think “backtrack for solution” could be misleading. I describe it as “dynamic programming” approach that starts with the base case align(0,0,t,q) = 0 and fills up the table. You might want to add align(0,0,t,q) = 0 as a base case to the equation.

Gradient Boosting: Learning Distance Functions Define the margin to be the score of the correct target minus the score of the highest scoring incorrect target Formulate a loss function according to this definition of margin Take the gradient of this function at each possible “replacement” that occurs in the training data. Iteratively move in this direction

Metric Access Methods: Overview Accuracy is not enough! Avoidance of linear search Ability to cull subsets with the computation of a single distance. Use of the triangular inequality Use of previous work to assure some level of satisfaction

The Triangular Inequality (Skopal, 2006) Need to increase small distances while large ones stay the same Applying a concave function does the job Function moves distances within the same range Could create a useless metric space Use line search to assure optimality

The Triangular Inequality: Concave Function Application

Vantage Point Trees: Overview Given a set S Select a “vantage point” v in S. Split s into sl and sr according to distance from v Call recursively on sl and sr Builds balanced binary tree

Vantage Point Trees: Demonstration B D H G E A F C

Vantage Point Trees: Demonstration B D H G E A F C

Vantage Point Trees: Demonstration F C B D H

Vantage Point Trees: Demonstration F C B D H

Vantage Point Trees: Demonstration F C B D H G H A E C F D B

Vantage Point Trees: Demonstration F C B D H G A F E C D B H

Vantage Point Trees: Demonstration F C B D H G A F E C D B H C B A F G E D H

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ D H G A F E C D B H

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ Q A E G F C B D H G A F E C D B H

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ Q A E G F C B D H G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ Q A E G F C B D H d(Q,Sl) < t G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t d(A,Sl) < m G A F E C D B H Assume Sl contains a sequence within distance t of Q, and that the triangular inequality holds

Vantage-Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t d(A,Sl) < m G A F E C D B H d(Q, A) > d(A,Sl) + d(Q,Sl)

Vantage-Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H d(Q,Sl) < t d(A,Sl) < m G A F E C D B H d(Q, A) > d(A,Sl) + d(Q,Sl) but according to the triangular inequality, d(Q, A) <= d(A,Sl) + d(Q,Sl)

Vantage Point Trees: Searching for Nearest Neighbors Within ‘t’ d(Q, A) > m + t Q A E G F C B D H C D B H Thus we have a contradiction and can eliminate Sl from consideration

Vantage Point Trees: Optimizing Similar proof for d(Q, A) < m – t However, if m – t < d(Q, A) < m + t, we can do nothing, and must search linearly If there are no nearest neighbors within t, no guarantees t should be as small as possible . . . . . . or, target/query distances should be as far as possible to the correct side of the median distance

Boosting for Efficiency: Summary Create an instance of a metric access data structure given a target set Define a loss function particular to that structure Take the gradient of this loss function Use the gradient to tune the distance metric to the structure

Boosting for Efficiency: vp-tree-based Loss Function Sum loss for each training query and each target along path to correct target Scale loss according to “cost” of mistake i is training query, j is target along path vij is “left or right”, mij is median fij(a,b) is replacement count for jth target in the ith training query’s path

Boosting for Efficiency: Gradient Expression Retains properties of the accuracy gradient (easy computation, ability to approximate) Expresses desired notion of loss and margin

Synthetic Domain: Summary Targets are sequences of five tuples (x, y) with domains (0,9) and (0,29) respectively Queries generated by moving sequentially through With p = 0.3, generate a random query event with y >=15 (insert) Else, if target y is < 15, generate match. If target y is >= 15, skip to next target element (delete). Matches are (x1, y1) -> (x1, (y1 + 1) % 30) Domain is engineered to have structure I wonder if it may be good to show the synthetic domain first and then show the Query by humming results.

Query-by-Humming: Summary Database of songs (targets) Can be queried aurally Applications Commercial Entertainment Legal Enables music to be queried on it’s own terms

Query-by-Humming: Basic techniques for query processing Queries require preprocessing Filtering Pitch detection Note segmentation Targets, too! Polyphonic transcription Melody spotting We end up with a sequence of tuples (pitch, time)

Application to Query-by-Humming: Our Data Set 587 Queries 50 Subjects (college and church choirs) 12 Query songs Queries split for training and test

Results: Experimental Setup First 30 iterations: Accuracy boosting only Construct VP-Tree Compare two methods for second 30 iterations Accuracy only Accuracy + efficiency (May require rebuild of tree) Efficiency is measured by plotting % of target set culled vs. error Vary t Would like low error and high % culled In reality, lowering t introduces error

Results: Query-by-Humming Domain

Results: Synthetic Domain

Conclusion Designed a way to specialize a metric to a metric data structure Showed empirically that accuracy is not enough Showed successful “twisting” of the metric space

Future Work Extending to other types of structured data Extending to other metric access methods Some are better (metric trees) Some are worse (cover trees) Use as general solution to structured prediction problem Use in automated planning and reinforcement learning

Fin.