Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Indexing DNA Sequences Using q-Grams

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Mining Time Series.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Multidimensional Data. Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

BTrees & Bitmap Indexes

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper.

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Hash Tables1 Part E Hash Tables  

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang

Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Multidimensional Data Many applications of databases are ``geographic'' = 2dimensional data. Others involve large numbers of dimensions. Example: data.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Efficient Processing of Top-k Spatial Preference Queries

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.

Exact indexing of Dynamic Time Warping

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

CS4432: Database Systems II

Fast Subsequence Matching in Time-Series Databases.

Spatial Data Management

Fast nearest neighbor searches in high dimensions Sami Sieranoja

Advanced Algorithm Design and Analysis (Lecture 12)

Evaluation of Relational Operations

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Similarity Search: A Matching Based Approach

Implementation of Relational Operations

Nearest Neighbors CSC 576: Data Mining.

Minwise Hashing and Efficient Search

Efficient Processing of Top-k Spatial Preference Queries

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp Washington, DC. March 2003.

BMI Winter'042 Overview Applications of queries Background on queries Current problem Solutions and our solution Comparison experiments and results Future work

BMI Winter'043 Queries in general We need a metric distance function –To measure the (dis)similarity btw objects Dynamic programming Algorithm –O( |string 1 | * |string 2 | ) time and space i.e. O(n 2 ) where n is length of the strings –Especially bad for genetic sequence queries where you have long sequences

BMI Winter'044 2 kinds of queries  -range queries –Retrieve all objects similar to query more than a certain degree  

BMI Winter'045 2 kinds of queries k-nearest neighbor (k-NN) queries –Retrieve k most similar objects No domain knowledge necessary Ex: 4 NN 

BMI Winter'046 2 kinds of queries  -range queries Requires domain knowledge –Data distribution & Distance definition   too small None returned

BMI Winter'047 2 kinds of queries  -range queries   too large All returned

BMI Winter'048 Measuring similarity We need a metric distance function –To measure the (dis)similarity btw objects Edit Distance (ED) –Three kinds of operations Insert, delete, replace –ACTTAGC to AATGATAG –A C T - - T A G C R I I D  ED = 4 A A T G A T A G - – Dynamic programming Algorithm – O(mn) time and space

BMI Winter'049 DPA

BMI Winter'0410 DPA 2

BMI Winter'0411 String/Genome Data Asks the most similar substrings in the database to the given string. BLAST has  -range queries –Naïve search (linear scan) –scalability problems How to Handle Size –Partial information rather than whole database Approximate the string data (compress)  may fit in memory  may be used for indexing, clustering

BMI Winter'0412 How to Handle Size 3 approaches to make use of compressed data 1.Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially  -range queries) 2.Get approximate answers, virtually no I/O (I/O only for answers) (especially k-NN queries) 3.Approximate pruning for  -range queries

BMI Winter'0413 Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

BMI Winter'0414 Big Picture General Approach step by step Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors Develop a distance function df in vector spaces to approximate the string similarity Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing- Implement one of the three approaches mentioned -Query-

BMI Winter'0415 String Database Overlapping Windows Windowing 1 Multidimentional Vectors Indexed with respect to some distance function Transformation Into vector Space Indexing 3 2 Preprocessing

BMI Winter'0416 Index of vectors Transformation Approximate Query (k-NN or  -range ) Query sequence 1 Index of vectors Exact Query (k-NN or  -range ) 2a 2b Done The vectors returned represent most of k-NN (or vectors in  - range ) + some false positives Candidate set Using the index Continued 

BMI Winter'0417 Calculate ED for each of them. (Remove false positives.) Refine I/O for strings represented by those vectors. 3 Candidate set Using the index

BMI Winter' ST Step: Partitioning into overlapping Windows AACCGGTTACGTACGT… e.g W=6 e.g  =2

BMI Winter' ND Step: Mapping Windows into Vector Space Choose a tuple size k Associate an int to each 4 k k-tuples Frequencies of those k-tuples, is the vector If k=2  4 k= 16 k-tuples AA, AC, AG, AT, CA, CC, CG, CT TA, TC, TG, TT GA, GC, GG, GT

BMI Winter'0420 Example Mapping The integers assigned AA=0, AC=1, AG=2, AT=3, CA=4, CC=5, CG=6, CT=7 TA=8, TC=9, TG=10, TT=11 GA=12, GC=13, GG=14, GT=15 Assume window AACCGG AA, AC, CC, CG, GG all occur once is the matching vector.

BMI Winter'0421 Different transformations & Distance Functions Tuple size  transformation size –1  4 (frequencies of A, C, G, T) FV 1 –2  16 (frequencies of 2-tuples)FV 2

BMI Winter'0422 Different transformations & Distance Functions 2 WV n transformation –String into halves x,y –FV n s for x,y  FV x,FV y –Concatenate addition and subtraction of them [ FV x + FV y, FV x -FV y ] Wavelet 1 on example –TCACTTAG –1 st : divide into halves & find FV 1 transformation x:TCAC  y:TTAG  –2 nd : add and subtract –1 –1 WV 1 Same operations on 2- tuples WV 2

BMI Winter'0423 Distance Functions on the Vector Spaces All of them are proved to be lower-bounds to edit-distance FD 1  distance on FV 1 FD 2  distance on FV 2 WD 1  distance on WV 1 WD 2  distance on WV 2

BMI Winter'0424 Frequency Distance FD n AlgorithmExample (n=1) FD n (n-gram frequencies u,v) posDist:=negDist:=0 for all dimensions u i,v i –If u i >v i then posDist:=u i -v i –else negDist:=u i -v i Return max(posDist, negDist)/n u:ACTTAGC  2,2,1,2 v:AATGATAG  4,0,2,2 – 2-4<0 negDist+=|2-4| –2-0>0 posDist+=|2-0| –1-2<0 negDist+=|1-2| –2-2=0 posDist:2 negDist:3 FD 1 is 3

BMI Winter'0425 FD n Why lower bound? On example –need to incresase A by 2 G by 1  3 –need to decrease c by 2 We may “increase+decrease” if we can replace (back to slide #8) So in best case edit dist is only FD 1 But it may not be the case, you may need more operations, because of mismatch of locations… Divide by n is because a change in one character, updates frequency of n n-grams.

BMI Winter'0426 Wavelet Distance WD n AlgorithmExample (n=1) WD n (n-gram frequency wavelets u,v) Find posDist and negDist on u,v m:=min(posDist, negDist) d:= (posDist-negDist)/2 if m < d –Return d / n else –Return (d + (m-d )/2 )/n u:ACTC TAGC  –1 0 v:AATG ATAG  posDist: = 4 negDist: = 4 m:4 d:0 (0 + 4/2)/1 Return 2

BMI Winter'0427 WD n Why lower bound? Assume a string transformed into wavelet [a 1,…a , b 1,…b  ] Largest change posDist+=3 negDist-=1 or vice versa –So use this change whenever posDist<>negDist

BMI Winter'0428 Overview Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work

BMI Winter'0429 Experiment Design Implemented transformations & distance functions Evaluated their pruning efficiency on  -range queries and approximation efficiency on k-NN queries experimentally on real genetic data Ran queries with different parameters –Varying string size W, shift amount  –Some containing exact match, some not –For  -range queries different  values –For k-NN queries different k values

BMI Winter'0430

BMI Winter'0431

BMI Winter'0432 Sorted Graphs To depict why our distance functions perform so good in k-NN Imitate what our k-NN approximation does, and graph the result –It sorts the data values in increasing order, and takes the k-nearest ones

BMI Winter' nearest 50 nearest

BMI Winter' nearest 50 nearest

BMI Winter'0435 Nature of the distance functions WD2 has very good performance in k-NN even though not so well pruning –Its variance of its ratio to edit distance is much lower than others as you would like for a distance function

BMI Winter'0436

BMI Winter'0437

BMI Winter'0438 Results Tested the parameters obtained by this random experiments, on real data. Then also did the parameter extraction using real data too.

BMI Winter'0439 Comparison of index structures

BMI Winter'0440 Future Work Check applicability of those methods to other kinds of sequence data. –Text –Image search Implement index structure in the standalone program, and make performance evaluation