Dr. N. MamoulisAdvanced Database Technologies1 Topic 7: Strings and Biological Data In some applications we store, search and analyze long sequences of.

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Sabegh Singh Virdi ASC Processor Group Computer Science Department

Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga.

CS Data Structures Chapter 10 Search Structures (Selected Topics)

Modern Information Retrieval

Heuristic alignment algorithms and cost matrices

Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.

Hashing General idea: Get a large array

Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Exact Indexing of Dynamic Time Warping

Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.

ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Analysis of Algorithms

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

CS Data Structures Chapter 10 Search Structures.

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.

CSC 211 Data Structures Lecture 13

CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Exact indexing of Dynamic Time Warping

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Fast Subsequence Matching in Time-Series Databases.

Spatial Data Management

A database index to large biological sequences

Query Processing in Databases Dr. M. Gavrilova

B- Trees D. Frey with apologies to Tom Anastasio

Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

CSE 589 Applied Algorithms Spring 1999

3. Brute Force Selection sort Brute-Force string matching

3. Brute Force Selection sort Brute-Force string matching

3. Brute Force Selection sort Brute-Force string matching

Presentation transcript:

Dr. N. MamoulisAdvanced Database Technologies1 Topic 7: Strings and Biological Data In some applications we store, search and analyze long sequences of discrete characters, which we call “strings” Typical Applications are Text Retrieval, Computational Biology, Signal Processing, etc. Queries on string sequences often allow errors in matches. Therefore an interesting and challenging subject is approximate string matching

Dr. N. MamoulisAdvanced Database Technologies2 Application 1: Information Retrieval A typical application of information retrieval is text searching; given a large collection of documents and some text keywords we want to find the documents which contain these keywords. In many cases search should allow errors. For example text collections digitized by OCR contain a large percentage of errors (7-16%). In addition, there could also be typing or spelling errors in documents. Therefore there are two research directions in text retrieval: exact search and approximate search. Naturally, approximate search is more difficult and expensive than exact search.

Dr. N. MamoulisAdvanced Database Technologies3 Application 2: Computational Biology The problem is similar in computational biology; here we have a long DNA sequence and we want to find subsequences in it that match approximately a query sequence. DNA and protein sequences can be seen as long texts over specific alphabets (e.g., {A,C,G,T}). Those sequences represent the genetic code of living beings. The similarity of two DNA substrings from different organisms may correspond to the same functional or physical relationship between these organisms. Exact searching here is of little use, since the query patterns rarely match the text exactly; the correct results may have small differences due to mutations and evolutionary alternations.

Dr. N. MamoulisAdvanced Database Technologies4 Application 3: Signal Processing In Speech Recognition the general problem is to determine a textual message from a transmitted audio signal. The signal could be compressed, or some words may not be pronounced well, so approximate matching is used. Another related problem is error correction. Compression introduces errors in the transmitted signals, so approximate search is needed for correcting these errors.

Dr. N. MamoulisAdvanced Database Technologies5 Similarity Metrics The similarity metric between two strings is typically dependent on the application and does not allow for general-purpose solutions. The most widely accepted similarity metric is the “edit distance”. The edit distance between two strings is defined by the number of primitive operations (insert, delete, replace) necessary to transform one string to the other

Dr. N. MamoulisAdvanced Database Technologies6 What is the edit distance between “survey” and “surgery”? Example of edit distance s u r v e y s u r g e y replace (+1) s u r g e r y insert (+1) Edit distance = 2

Dr. N. MamoulisAdvanced Database Technologies7 In the general version of edit distance, different operations may have different costs, or the costs depend on the characters involved. For example replacement could be more expensive than insertion, or replacing “a” with “o” could be less expensive than replacing “a” with “k”. The general edit distance is powerful enough for a wide range of applications therefore most query processing algorithms consider it as a standard. The general edit distance does not satisfy the triangular inequality and thus it is not a metric. Edit Distance (cont’d)

Dr. N. MamoulisAdvanced Database Technologies8 Example of generic edit distance: String Alignment In biological string matching the similarity metric is often called “string alignment” Let S and T be strings. An alignment maps S and T into string S' and T' by inserting spaces into S and T, such that |S'|=|T'|. S=acgcaggtc T=agcgtc acgcaggtc ||||||||| ag_cg__tc acgcaggtc ||||||||| a_gc__gtc Example: optimal alignment=distance

Dr. N. MamoulisAdvanced Database Technologies9 Other Similarity Metrics Edit distance with transpositions (e.g., ab  ba) LCS distance (allows only insertions/deletions) Hamming distance (allows only substitutions – only when |s1|=|s2|) Episode distance (allows only insertions – not symmetric) Reversals (allows reversing substribgs) Block distance (allows rearrangement and permutation of substrings) q-gram distance (based on finding common substrings of fixed length q).

Dr. N. MamoulisAdvanced Database Technologies10 Definition of the basic approximate search problem Given a query substring q and a long sequence t, find all substrings in t which are similar to q given a distance metric. Two versions of the problem: 1. We are given a similarity threshold k and ask for all substrings within this distance from q 2. We are given a number k and we ask for the k- Nearest Neighbors In a more general problem version, the long sequences t could be more than one.

Dr. N. MamoulisAdvanced Database Technologies11 A dynamic progamming algorithm for computing the edit distance Problem: find the edit distance between strings x and y. Create a (|x|+1)  (|y|+1) matrix C, where C i,j represents the minimum number of operations to match x 1..i with y 1..j. The matrix is constructed as follows. C i,0 = i C 0,j = j C i,j =  C i-1,C j-1 if x i =y i,  1+min(C i-1,C j, C i,C j-1, C i-1,C j-1 ), else.

Dr. N. MamoulisAdvanced Database Technologies12 surgery s u r v e y Example: Optimal alignment

Dr. N. MamoulisAdvanced Database Technologies13 How do we perform substring search? The same dynamic programming algorithm can be used to find the most similar substrings of a query sting q. The difference is that we set C 0,j =0 for all j, since any text position could be the potential start of a match. If the similarity distance bound is k, we report all positions, where C m ≤k (m is the last row – m = |q|).

Dr. N. MamoulisAdvanced Database Technologies14 surgery s u r v e y Example: q=survey, t=surgery, k=

Dr. N. MamoulisAdvanced Database Technologies15 Comments on the dynamic programming algorithm Observe that essentially it is the same algorithm used for Dynamic Time Warping. It is very flexible, since it can be used to compute most distance metrics under the “generic edit distance” definition. It also requires only O(|q|) space (only the previous column is necessary to compute the next one). Thus a single scan of t suffices. However, it is not efficient. The worst-case complexity is O(|q||t|). Several variations have been proposed to reduce its complexity.

Dr. N. MamoulisAdvanced Database Technologies16 Using diagonals of the matrix to compute the edit distance faster Improved versions of the basic dynamic programming algorithm are based on the observation that the diagonals of the matrix are monotonically increasing. These are called “diagonal transition” algorithms. The dynamic programming matrix is computed diagonal-wise instead of column- wise. The key to fast computation is to find in constant time where each stroke (i.e., diagonal sequence with the same error)

Dr. N. MamoulisAdvanced Database Technologies17 surgery s u r v e y Example: q=survey, t=surgery, k= Strokes with error e≤k, ending at |q|th row indicate results

Dr. N. MamoulisAdvanced Database Technologies18 Comments on the diagonal approach The complexity is O(|t|k), since there are O(|t|) diagonals and at which we compute at most k strokes. We can use the ending of previous strokes to detect the beginning of new ones The hard part is to find the length of the next stroke in constant time. This is equivalent to finding the longest prefix of the remainder of q that matches t. It can be computed by building a suffix tree for tq

Dr. N. MamoulisAdvanced Database Technologies19 Comments on the diagonal approach (cont’d) Several improvements on this method have been proposed. However still the algorithm has a relatively high complexity and needs to scan the whole string t. Thus the question is whether we can avoid scanning the whole database for each query.

Dr. N. MamoulisAdvanced Database Technologies20 Filtering algorithms for approximate string matching They do not provide solutions, but aim to reduce the search effort. They are based on the idea that it is most probable for a text area not to match a query rather than to match it. They work in two steps. First a cheap heuristic is used to determine whether an area of the text t could match with the query q. If not search is abandoned for this area, otherwise an expensive search algorithm (i.e., dynamic programming) is applied

Dr. N. MamoulisAdvanced Database Technologies21 Dynamic Filtering Scans the text t linearly and at each position finds the longest substring that is included in q. Thus the text is transformed to a sequence of substrings in q, separated by non-matching characters. Each time an area of k+1 substrings is examined. If this area is shorter than |q|-k, search is abandoned for it, otherwise dynamic programming is applied

Dr. N. MamoulisAdvanced Database Technologies22 Dynamic Filtering Example (k=2) q: agatacat t: gattacgggaaggtttac Longer than |q|-k : we have to examine it Shorter than |q|-k : we do not have to examine it

Dr. N. MamoulisAdvanced Database Technologies23 A similar method is based on q-grams All substrings (q-grams) of a specific length m in q are computed. For each area of the text the number of q-grams that occur there are computed. If this number is smaller than (|q|-m+1- km) the text area is pruned, otherwise dynamic programming is applied.

Dr. N. MamoulisAdvanced Database Technologies24 A sampling method A sample of non-overlaping substrings of q is computed so that they have a fixed gap between them. If a text area does not contain any of the samples, it is pruned Otherwise the neighborhood around the found sample is verified. q: agatacat t: gattacgggaagatttac samples sample found

Dr. N. MamoulisAdvanced Database Technologies25 Sub-linear algorithms In order to avoid applying the above methods at every position of the text, the text is split to blocks of (|q|-k)/2 size and substring checking is applied only at the beginning of each block. Blocks should be entirely contained in results, so if a block does not qualify, we immediately move to the next one. q: agatacatacagatat t: gattacgggaaggtttac discarded block move to next

Dr. N. MamoulisAdvanced Database Technologies26 Comments on the methods discussed so far They mainly focus on improving worst- case theoretical bounds In general, they require a linear scan of the database, so they are not attractive for very large problems Recently, there are efforts from DB researchers to make these methods more scalable

Dr. N. MamoulisAdvanced Database Technologies27 An adaptation of the suffix tree for secondary memory [Hunt et al., 2001] Suffix trees are main memory structures used for exact substring search. Algorithms for performing approximate search on suffix trees are also available. However, there is no version of the suffix tree for secondary memory. A recent paper tries to solve this hard problem.

Dr. N. MamoulisAdvanced Database Technologies28 Example of a suffix trie S=A C A T C T T A String Suffix trie

Dr. N. MamoulisAdvanced Database Technologies29 Example of a suffix tree S=A C A T C T T A String Suffix tree The suffix tree is a suffix trie, where unary paths are compressed Numbers showing the substring indexed by a node are added $ Special termination characters are added Suffix links connect node indexing aw to node indexing w

Dr. N. MamoulisAdvanced Database Technologies30 Suffix links Suffix links were introduced to improve the construction cost of the suffix tree to O(n). However they transform the tree to a graph, that cannot be managed in memory unless small. If we rely on the virtual memory of the machine a lot of swapping takes place and the construction algorithm is very slow. Biological sequences are very long (with millions or trillions of characters), thus we cannot use the suffix-link algorithm to construct them.

Dr. N. MamoulisAdvanced Database Technologies31 A new method for Suffix tree construction is proposed Suffix links are abandoned. The algorithm scans the string multiple times in order to construct the suffix tree for a subrange of suffixes in each pass. Abandoning suffix links means that the worst- case construction cost becomes O(n 2 ), but due to the pseudo-random nature of DNA, the average behavior is O(nlogn)

Dr. N. MamoulisAdvanced Database Technologies32 Description of the algorithm First the number of partitions is determined. If the expected size of the full suffix tree is S and the available memory is M, the number of partitions is given by  S/M . Each partition corresponds to a set of suffixes. For each partition we build a separate suffix tree. These suffix trees are then connected via a common root.

Dr. N. MamoulisAdvanced Database Technologies33 Description of the algorithm (cont’d) The second step is to assign suffixes to partitions The partition for a suffix is determined upon its prefix. For example, suffix ATCTTA is indexed in the subtree following branches AT. Notice that these subtrees are independent. Thus the suffixes in the same group have the same prefix.

Dr. N. MamoulisAdvanced Database Technologies34 Description of the algorithm (cont’d) There are two ways to determine the partitions. We scan the string once and count the occurencies of every 3-character substring. Then split them in groups with equal cardinality We just partition lexicographically all 3-character strings, since we expect them to appear with equal probability in DNA AAA32 AAC42 AAG23 AAT76 ACA43 ACC TTT P1P1 P2P2

Dr. N. MamoulisAdvanced Database Technologies35 Description of the algorithm (cont’d) Common root …. Suffix tree for P 1 Suffix tree for P 2 Suffix tree for P K

Dr. N. MamoulisAdvanced Database Technologies36 Description of the algorithm (cont’d) For each partition j Initialize the suffix tree that corresponds to this partition. Then scan the string. For each position i of the string, if the suffix s i is in partition j, insert the suffix to the subtree which corresponds to this partition Thus one suffix-tree is created at each pass and these are stored independently on disk

Dr. N. MamoulisAdvanced Database Technologies37 Example of Suffix tree creation Create a suffix tree for AGA The insertion order of suffixes is AGA$, GA$,A$ root GA $ AGA $

Dr. N. MamoulisAdvanced Database Technologies38 Example of Suffix tree creation Create a suffix tree for AGA The insertion order of suffixes is AGA$, GA$,A$ root GA $ $ A $

Dr. N. MamoulisAdvanced Database Technologies39 Overview of the partitioned suffix tree The partitioned suffix tree is constructed for long sequences at no additional computational cost, despite the abandonment of suffix links. The experiments show that the construction time is linear to the sequence length. However the tree is tested only for exact matches, where only one path is traversed for each query. On the other hand, approximate search needs to combine information from multiple paths and it is questionable whether this method can perform well.

Dr. N. MamoulisAdvanced Database Technologies40 A Filtering Algorithm for a Database of Long Strings A new filtering algorithm for large datasets (that do not fit in memory) was recently proposed. The database S may contain many, potentially long strings S = {s 1,s 2,…,s d } The algorithm transforms the substrings in this database to high dimensional points and indexes them. A query is then applied on the indexes using a lower bound of the edit distance.

Dr. N. MamoulisAdvanced Database Technologies41 Idea: A Transformation and a Lower Bound for the Edit Distance Given an alphabet Σ = {α 1,α 2,...,α d }, we can transform a string to a d-dimensional point, called frequency vector. Example m=4, Σ = {A,C,G,T}. String s=TACTTAG is transformed to f(s)=[2,1,1,3]. The transformation maps strings of the same length n to points on the (n-1) dimensional plane TACTTAG [2,1,1,3] sum(f(s i ))=n

Dr. N. MamoulisAdvanced Database Technologies42 Relation between edit distance and frequency vector Edit operations (insert, delete, replace) has the following effect on the frequency vector: A scalar increases (insert) A scalar decreases (delete) A scalar increases and another decreases (replace) Example: TACTTAGTCCTTAG [2,1,1,3][1,2,1,3]

Dr. N. MamoulisAdvanced Database Technologies43 Neighborhood and the frequency distance Two frequency vectors are called neighbors, if one can be transformed to the other by a single edit operation (e.g., [2,1,1,3] and [1,2,1,3] are neighbors). The frequency distance FD 1 between two vectors is defined by the minimum number of steps in order to go from one to the other by moving to a neighbor point at a time. The frequency distance is a lower bound of the edit distance between the corresponding strings.

Dr. N. MamoulisAdvanced Database Technologies44 Computing the frequency distance Let u, v be two d-dimensional vectors. Let posDist = 0, negDist = 0 For each dimension i If u i >v i posDist += u i -v i ; Else negDist += v i -u i ; Return max(posDist, negDist) Assume w.l.o.g. that posDist>negDist. The rationale is that we can combine each deletion in negDist with an insertion in posDist, and the map the remainder of posDist as an insertion.

Dr. N. MamoulisAdvanced Database Technologies45 Computing the frequency distance (example) s 1 = TACTTAG s 2 = TTAGAG u = [2,1,1,3] v = [2,0,2,2] posDist = sum([0,0,1,0]) = 1 negDist = sum([0,1,0,1]) = 2 FD 1 (u,v) = 2 ED(s 1, s 2 ) = 4

Dr. N. MamoulisAdvanced Database Technologies46 Use of the frequency distance Suppose we want to find all substrings s in t which match query q within edit distance k. If the frequency distance FD 1 (f(s),f(q)) is larger than k we can prune s because the edit distance is also larger than k.

Dr. N. MamoulisAdvanced Database Technologies47 Using Wavelet Transforms to enrich the frequency vector For biological applications, where Σ = {A,C,G,T}, a 4-dimensional vector is too small to capture in detail the contents of the strings. Thus we need to capture also the local frequencies of the characters This can be done by applying a wavelet transformation to the string.

Dr. N. MamoulisAdvanced Database Technologies48 1 st /2 nd Wavelet Coefficients The idea is to break the string s in two parts s 1, s 2 of equal length and compute two vectors; The first vector is f(s). The second f’(s) is computed by array subtraction f(s 1 )-f(s 2 ). Thus f’(s) gives the difference between the frequencies in the left substring and the frequences in the right substring. Then we approximate the string by a new feature vector ψ(s) = [f(s),f’(s)].

Dr. N. MamoulisAdvanced Database Technologies49 Example s = TCACTTAG, f(s) = [2,2,1,3] s 1 = TCAC, s 2 = TTAG f(s 1 ) = [1,2,0,1], f(s 2 ) = [1,0,1,2] f(s’) = f(s 1 ) – f(s 2 ) = [0,2,-1,-1] ψ(s) = [f(s), f(s’)] = [[2,2,1,3],[0,2,-1,-1]]

Dr. N. MamoulisAdvanced Database Technologies50 New Lower Bound The new edit distance lower bound between two strings x and y is then defined by FD(x,y) = max{FD 1 (f(x),f(y)),FD 2 (ψ(x),ψ(y))} FD 1 is the previously defined frequency distance between frequency vectors FD 2 is a more complex distance defined on the wavelet trasforms (details in the paper).

Dr. N. MamoulisAdvanced Database Technologies51 Generalizing the Idea We can generalize the idea to derive k th - level wavelet coefficients ψ k (s), longer vectors and higher accuracy. ψ k (s) divides the string into |s|/2 k substrings and gives the Wavelet Coefficient for each substring, recursively. larger k provides more information, but is more expensive to evaluate.

Dr. N. MamoulisAdvanced Database Technologies52 Example for k=3 s = TCACTTAG ψ(s) = ψ 3 (s) = [[2,2,1,3],[0,2,-1,-1]] TCACTTAG [[1,2,0,1],[-1,0,0,1]]ψ 2 (s)[[1,0,1,2],[-1,0,-1,2]] ψ 1 (s) TCACTTAG [[0,1,0,1],[0,-1,0,1]]…. [[1,0,1,0],[1,0,-1,0]]

Dr. N. MamoulisAdvanced Database Technologies53 The Indexing Method Assume that we want to index a long sequence S. Let 2 a, be the minimum length of a query (a power of 2). We slide a window of length 2 a to each position of S and compute the first two wavelet coefficients to a vector v. Thus each position is indexed by a 2d-dimensional point. The points are then indexed in an R-tree-like index. [2,1,2,2,0,2,-2,-1] [2,1,0,2,0,1,-2,-1] [1,1,2,22,-2,-2,-1] [0,1,2,3,-1,2,0,-1] slide MBR

Dr. N. MamoulisAdvanced Database Technologies54 The Indexing Method (cont’d) We do the same for all window sizes up to a maximum length 2 b. Therefore for each long sequence S a number of MBR-lists for various lengths are created. If we have multiple long sequence, a matrix-like index structure is created, where each row corresponds to a resolution level and each column to an indexed sequence.

Dr. N. MamoulisAdvanced Database Technologies55 The Indexing Method (example) S1S1 T 1,1 T 2,1 T l,1 2a2a 2 a+1 2 a+l-1 S2S2 T 1,2 T 2,2 T l,2 …… SmSm T 1,m T 2,m T l,m …

Dr. N. MamoulisAdvanced Database Technologies56 Query Processing The query is chopped to as large pieces as possible in order to search the lowest-possible row of the index. Example: q=AGAGTATTTACCTG is chopped to AGAGTATT, TACC, and TG The pieces are used to search the corresponding index raw. The qualifying results are merged to get the candidate sequences and then evaluated using dynamic programming.

Dr. N. MamoulisAdvanced Database Technologies57 Comments This index can also be used to evaluate nearest-neighbor queries It performs well experimentally. However: It relies on the quality of the MBRs in the index. It relies on the effectiveness of the wavelet coefficients to approximate the structure It has a high preprocessing cost (i.e., building the index is costly).

Dr. N. MamoulisAdvanced Database Technologies58 Summary There are many methods proposed for approximate string matching. Most are based on the generic edit distance function. Some improve the speed of the dynamic programming method. Some extend exact string search methods (like suffix trees) for approximate search Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence. Some apply filtering after transforming the data to an approximate space appropriate for conventional search techniques.

Dr. N. MamoulisAdvanced Database Technologies59 Summary (cont’d) Although there have been many efforts for efficient approximate string matching, there is still room for improvement. This is because of the complex distance functions which make it hard to use traditional search techniques. Biologist are still not (and will never be!) satisfied by computational techniques, since they expect them to be general enough to solve every special search problem they encounter.

Dr. N. MamoulisAdvanced Database Technologies60 References G. Navarro, A Guided Tour to Approximate String Matching, ACM Computing Surveys, 33(1): 31-88, March Ela Hunt, Malcolm P. Atkinson, Robert W. Irving: A Database Index to Large Biological Sequences. VLDB Tamer Kahveci, Ambuj K. Singh: Efficient Index Structures for String Databases, VLDB HKU CSIS7402 (Computer Technology for Bioinformatics) Lecture Notes, Special thanks to Lok Lam Cheng