Download presentation
Presentation is loading. Please wait.
1
Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi
2
Mondello, 07/07/2005 Outline 1. Motivations and basic definitions 2. The languages L(S,k,r) 3. The repetition index R(S,k,r) 4. Some combinatorial properties of the repetition index 5. A trie based approach for approximate indexing data structures 6. Conclusions and related works
3
Mondello, 07/07/2005 It concerns the finding of strings in texts in presence of “errors” or “mismatches”. Recovering the original signals after their transmission over noisy channels; Finding DNA subsequences after possible mutations; Text searching where there are typing or spelling errors; Retrieving musical passages. Main motivation: Approximate String Matching It has several applications in data analysis and data retrieval, such as:
4
Mondello, 07/07/2005 Each application uses a different error model, which defines how different two strings are. Some best studied cases of error models are: Levenshtein or edit distance [Levenshtein, 1965]: it allows us to insert, delete and substitute simple characters (with a different one) in both strings; Hamming distance [Sankoff and Kruskal, 1983]: it allows us only substitutions; Scoring functions: they are not distances in mathematical terms and they measure the similarity degree between two words.
5
Mondello, 07/07/2005 Ex.: x=acgtatct, y=aggttact The distance d(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and if no such sequence exists). The different possible operations are: We consider the Hamming distance, that allows only substitutions, which cost 1 (simplified definition). It is finite whenever |x|=|y| and it holds 0 d(x,y) |x|. Ex.: x=acgtatct, y=aggttact d(x,y)=3 (in the simplified definition) 3) Substitution, 4) Transposition. 1) Insertion, 2) Deletion,
6
Mondello, 07/07/2005 Let S be a string over the alphabet Σ, and let k, r be non negative integers such that k ≤ r. A string u occurs in S at position l up to k errors in a window of size r, or simply k r -occurs in S at position l, if one of the following two conditions hold: − if |u| < r d(u, S(l, l+|u|-1)) ≤ k; − if |u| ≥ r i, 1≤ i ≤ |u|-r+1, d(u(i,i+r-1), S(l+i-1, l+i+r-2)) ≤ k. A string u satisfying above property is a k r -occurrence of S. Typical approaches for finding a string x in a text S: to consider a percentage D of errors, or to fix the number k of them. Hybrid approach: to introduce a new parameter r and to allow at most k errors for any window (or factor) of length r. Let L(S,k,r) be the set of words that k r -occurs in S at position l, for some l, 1≤ l ≤ |S|-|u|+1. The parameter r introduced in the previous definition can be fixed or can vary as a function of the text.
7
Mondello, 07/07/2005 Example of L(S,k,r) S=abaa k=1, r=2 L(S,1,2)={a,b,aa,bb,ab,ba,bb,aaa,aab,aba,abb,baa, bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab, bbba} bbba L(S,1,2), but bbba L(S,1,4)
8
Mondello, 07/07/2005 The Repetition Index R(S,k,r) of S is the smallest integer h such that all strings of length h k r -occur at most in a unique position of the text: R(S,k,r) = min{h 1 s.t. i, j, 1 i, j |S| - h + 1, V(S(i,i+h-1),k,r) V(S(j,j+h-1),k,r) i=j}, where V(u,k,r) is the set of all words of length |u| that have at most k errors in every window of size r with respect to u. Remarks: 1.R(S,k,r) is well defined because the integer h=|S| is an element of the set above described; 2.If k/r 1/2 then R(S,k,r)=|S|.
9
Mondello, 07/07/2005 Example Let us consider the string S = a b c d e f g h i j k l m n o a b z d e z g h z j k z m n z with k = 1 and r = 2. k/r = 1/2 R(S,1,2)=|S|=30. A word w of length R(S,1,2)-1=29 that 1 2 -appears at position 1 and 2 is w = a c c e e g g i i k k m m o o b b d d z z h h j j z z n n
10
Mondello, 07/07/2005 Some combinatorial properties of R(S,k,r) Lemma 1: If k and S are fixed, R(S,k,r) is a non- increasing function of r; Lemma 2: If r and S are fixed, R(S,k,r) is a non- decreasing function of k; Lemma 3: If k and S are fixed and r R(S,k,r), the repetition index gets constant. Theorem If k and S are fixed, there exists only one solution to the equation r = R(S,k,r).
11
Mondello, 07/07/2005 An Index over a fixed text S is an abstract data type which basic set is Fact(S) and that contains operations giving access to factors of S. The principal operations are: 3) Number of occurrences: given x Fact(S), find the number of occurrences of x in S; 1) Membership: given a word x, say if x Fact(S); 2) Position: given x Fact(S), find the left position of its first (resp. last) occurrence in S; 4) List of positions: given x Fact(S), produce the list occ(x) of the occurrences of x in S. All these operations can easily be extended to the case of approximate string matching.
12
Mondello, 07/07/2005 We give the following results. The size of this indexing data structure is linear times a polylog of the size of the text S on average, i.e. O(|S| log k |S|). For each word x, the time spent by our algorithms for finding the list occ(x) of all k r - occurrences of the word x in the text S is proportional to |x|+|occ(x)| on average.
13
Mondello, 07/07/2005 Description of the indexing data structure Build the trie T(I,k,r) that represents the set of all possible strings having length R(S,k,r) that k r -occur in the string S; Add to any leaf of the trie T(I,k,r) an integer i that is the starting position of the k r -occurrence of S represented by the concatenation of the labels from the root to the leaf i.
14
Mondello, 07/07/2005 Finding all k r -occurrences of a string x in a text S “Read” as long as possible the string x and let q the last visited node i) If q is a leaf and |x|=R(S,k,r) return i; ii) If q is a leaf and |x|>R(S,k,r) if x k r -occurs in S at position i then return i else “x is a false positive” iii) If |x|<R(S,k,r) return occ(x). The list of all k r -occurrences of x has at most one element The list of all k r -occurrences of x can have more than one element In iii) we use the Colored Range Query solution [Muthukrishnan, SODA’02].
15
Mondello, 07/07/2005 Proposition: The overall time for finding all k r -occurrences of a string x in a text S is O(|x|+|occ(x)|). Theorem: The k-mismatch problem on a text S over a fixed alphabet can be settled by a data structure having average size O(|S|∙log k |S|) that answers queries in time O(|x|+|occ(x)|), for any query word x. Results
16
Mondello, 07/07/2005 Conclusions and related work 1. Results of this paper are in PhD Thesis of A. Gabriele [Genuary, 2005] 2. Independently, M. Maass and J. Nowak gave analogous results, by using the same data structure essentially and the CRQ solution [Preprint March, 2005 – CPM June, 2005], but: - a window is not used -it is improved the analysis on the size of the data structure - the technique is extended to edit distance 3. It is still open to find an indexing data structure of linear times a polylog size and searching time O(|x|+|occ(x)|)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.