Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

MATH 224 – Discrete Mathematics
Lecture 24 MAS 714 Hartmut Klauck
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Chapter 4: Trees Part II - AVL Tree
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Algorithm : Design & Analysis [19]
CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Dynamic Programming Lets begin by looking at the Fibonacci sequence.
Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga.
SASH Spatial Approximation Sample Hierarchy
CS Data Structures Chapter 10 Search Structures (Selected Topics)
CS 253: Algorithms Chapter 6 Heapsort Appendix B.5 Credit: Dr. George Bebis.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Data Structures – LECTURE 10 Huffman coding
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
AALG, lecture 11, © Simonas Šaltenis, Range Searching in 2D Main goals of the lecture: to understand and to be able to analyze the kd-trees and.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Randomized Algorithms - Treaps
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
CS Data Structures Chapter 10 Search Structures.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
CSC 211 Data Structures Lecture 13
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Temporal Indexing MVBT.
Temporal Indexing MVBT.
Haim Kaplan and Uri Zwick
CSE 589 Applied Algorithms Spring 1999
Lecture 8. Paradigm #6 Dynamic Programming
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Dynamic Programming II DP over Intervals
Chap 3 String Matching 3 -.
Presentation transcript:

Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever division of the search pattern sub-linear time is achieved by “A Hybrid Method for Approximate String Matching” G. Navarro, R. Baeza-Yates

The task is to find substrings from the long text T, approximately matching to our pattern P. For example, we have text T='adbc' and P='abc' (s-starting position of a substring) adbc=a+d+b+c – insertion of 'd' (s=1) dbc=(a->d) +b+c – replacement 'a' with 'd' (s=2) bc=(a)+b+c – deletion of 'a' (s=3) The Task

Errors There are 3 kinds of transformations, which make errors in initial string: insertion, replacement and deletion. If we transform S with such chnges to S', then we can transform S' to S with the same number of changes. The minimal number of deletions, insertions and replacements, needed to transform string A to string B is called edit distance between A and B (ed(A,B)). Example: ed('abc','adbc')=ed('dbc','abc')=ed('bc','abc')=1; ed('survey','surgery')=2 survey surgey surgery replace 'v' with 'g' insert 'r'

The resulting algorithm The algorithm solves the approximate string matching problem in O(n log n) time (n is the size of text T,  (0,1)), if the error level, where  is the size of the alphabet, e= ,  =k/m, k is the number of errors, m is the size of the pattern P.

Plan of the report ● Some useful ideas & basic algorithms ● The main algorithm ● Analysis of the complexity of the algorithm in different cases

Dividing the pattern Lemma There are strings A and B, ed(A,B)  k, and we divide A into j substrings (A i ). Then at least one of the (A i ) appear in B with at most ⌊ k/j ⌋ errors. We need k changes to transform A to B. Each change transforms one of the A i, so k changes are distributed between j substrings => the average number of changes is k/j. Example: ed('he likes','they like')=3=k; A 1 ='he ',A 2 ='likes' => j=2; ed('he ','they ')=2; ed('likes','like')=2= ⌊ k/j ⌋

Computing edit distance There are strings x,y. ed(x,y)=? x=x 1 x 2...x m,y=y 1 y 2..y n, x p,y q  Σ C ij =ed(x 1..x i,y 1..y j ) C 0..|x|,0..|y| is a matrix, filled with C ij. Computing C ij : C 0,j =j; C i,0 =i; C i,j = if (x i =y j ) then C i-1,j-1 else 1+min{C i-1,j-1,C i-1,j,C i,j-1 } Example: x='survey',y='surgery'... u r v ey... u r g ery green means x i <>y j red means x i =y j arrow shows the element used to sompute C ij

Edit distance in the case of text T and pattern P We need to find a substring in text T which matches P with minimal number of errors. Let x be the pattern, and y will be a text. The matching text substring can begin in every text position – so, we have to initialize C 0,j with 0. The rest is left from the previous task. The algorithm can store only the last column and analyze the text incrementally from the beginning. It goes left and down through the matrix, filling it with C ij.

Examples of the matrices

Construction of the NFA Nondeterministic Finite Automaton, which is searching the text substrings, approximately matching to the pattern with k errors. It consists of k rows and m columns. Transitions: Pattern and text characters are the same (horizontal) Insert characters into pattern (vertical) Replace the pattern character with the text one (solid diagonal) Delete the pattern character (dashed diagonal)

Nondeterministic Finite Automaton This automaton is for approximate string matching for the pattern 'survey' with 2 errors

Depth-first search DEF “k-neighborhood” is the set of strings that match P with at most k errors: U k (P)={x   * : ed(x,P)  k} Searching this strings in the text (without errors) can solve the problem, but |U k (P)|=O(m k  k ) is quite large. We can determine which strings form U k (P) appear in the text by traversing the text suffix tree. Here we can use the U k t (P) set. U k t (P) is a set of neighborhood elements which are not prefixes of others.

Algorithm for searching on the suffix tree ● Starts from the root ● Considers the string x incrementally ● Determines when ed(P,x)  k ● Determines when ed(P,xy)>k for any y

Algorithm for searching on the suffix tree – Each new character of x corresponds to a new column in the matrix (adding s  to x updating column in O(m) time). – A match is detected when the last element of the column is  k – x cannot be extended to match P when all the values of the last column are > k

Algorithm for searching on the suffix tree (illustration)

Partitioning the pattern The cost of the suffix tree search is exponential in m and k, so it's better to perform j searches of patterns of length m/j and k/j errors – that's why we divide patterns. So, we divide our pattern into j pieces and search them using the above algorithm. Then, for each match found ending at text position i we check the text area [i-m- k..i+m+k] But the larger j, the more text positions need to be verified, and the optimal j will be found soon.

Searching pieces of the pattern Let's use NFA with depth-first search (DFS) technique (the suffixes from the suffix tree will be the input of the automaton) At first, we'll transform our NFA – Initial self-loop isn't needed (it allowed us earlier to start matching from every position of the text); – We remove the low-left triangle of our automaton, because we avoid initial insertions to the pattern – We can start matching only with k+1 first pattern characters

The changes to NFA

Using suffix array instead of suffix tree The suffix array can replace the suffix tree in our algorithm. It has less space requirements, but the time complexity should be multiplied by log n. Suffix array replaces nodes with intervals and traversing to the node is going to the interval. If there is a node and it's children, then the node interval contains children intervals.

Analysis for the algorithm: the average number of nodes at level l ● For a small l, all the text suffixes (except the last l) are longer than l, so nearly n suffixes reach level l; ● The maximum number of nodes in the level l is  l, where  =|  |; ● We use the model of n balls randomly thrown into  l urns. The average number of filled urns is  l (1-(1-1/  l ) n )=  l (1-e -  (n/  l ) )=  (min{n,  l })

Probability of processing a given node at depth l in the suffix tree. If l  m', at least l-k text characters must match the pattern (m’ is the pattern size), and if l  m', at least m'-k pattern characters must match the text. We sum all the probabilities for different pattern prefixes: The largest term of the 1 st sum is the first one: and by using Stirling's approximation we have:

Probability of processing a given node at depth l in the suffix tree...which is: =  (  ) l O(1/l), where The whole first summation is bounded by l-k times the last term, so we get (l-k)  (  ) l O(1/l)=O(  (  ) l ).The first summation exponentially decreases if  (  ) <1. It means that: >e 2 /(1-  ) 2 (because e -1 <   /(1-  ) if  [0,1]),  =k/l

Probability of processing a given node at depth l in the suffix tree...or, equivalently, The second summation can be also bounded by this O(  (  ) l ). So the upper bound for the probability of processing a given node at depth l in the suffix tree is O(  (  ) l ). In practice, e should be replaced by c=1.09 (it was defined experimentally), because we have only founded the upper bound of the probability.

Analysis of the single pattern search in the suffix tree Using the formulas bounding the probability of matching, let's consider that in levels l: all the nodes are visited, while nodes at level l>L(k) are visited with probability O(  (k/l) l ). Remember that the average number of visited nodes at the level l (for small l) is  (min{n,  l }).

Three cases of analysis

The cases of analysis (a) L(k)  log  n, n  L(k) “small n” online search preferable, no index needed (since the total work is n); (b) m+k  m+k “large n” the total cost independent on n;(  =k/l) (c) L(k)  log  n  m+k “intermediate n”, sublinear of n time.

Analysis of pattern partitioning We need to perform j searches and then verify all the possible matches. We also determine three cases according to previous slide: (a)  j log  n, n  L(k/j), complexity O(n) (b) m+k   m+k)/j, if error level the complexity is O(n 1-log  ) - sublinear of n (using j=(m+k) / log  n) (c) with the same as in (b) error level, using the same j, we also get sublinear complexity.

Other types of algorithms ● Limited depth-first search technique determines viable prefixes (the prefixes of the possible pattern matches) and searches for them in the suffix tree (it is expensive and it cannot be implemented on the suffix array) ● Filtering discard large parts of the text checking for a necessary condition (simpler than the matching condition). Most existing filters are based on finding substrings of the pattern without errors, and with big error level they can't work.

Summary & Conclusions ● The splitting technique balances between traversing too many nodes of the suffix tree and verifying too many text positions ● The resulting index has sublinear retrieval time (O(n )), 0< <1) if the error level is moderate ● In future there can appear more exact algorithms to determine the correct number of pieces in which the pattern is divided and there are (and may appear in future) some better algorithms for verifying after matching a piece of pattern.