Presentation is loading. Please wait.

Presentation is loading. Please wait.

Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Similar presentations


Presentation on theme: "Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi."— Presentation transcript:

1 Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi

2 Mondello, 07/07/2005 Outline 1. Motivations and basic definitions 2. The languages L(S,k,r) 3. The repetition index R(S,k,r) 4. Some combinatorial properties of the repetition index 5. A trie based approach for approximate indexing data structures 6. Conclusions and related works

3 Mondello, 07/07/2005 It concerns the finding of strings in texts in presence of “errors” or “mismatches”. Recovering the original signals after their transmission over noisy channels; Finding DNA subsequences after possible mutations; Text searching where there are typing or spelling errors; Retrieving musical passages. Main motivation: Approximate String Matching It has several applications in data analysis and data retrieval, such as:

4 Mondello, 07/07/2005 Each application uses a different error model, which defines how different two strings are. Some best studied cases of error models are: Levenshtein or edit distance [Levenshtein, 1965]: it allows us to insert, delete and substitute simple characters (with a different one) in both strings; Hamming distance [Sankoff and Kruskal, 1983]: it allows us only substitutions; Scoring functions: they are not distances in mathematical terms and they measure the similarity degree between two words.

5 Mondello, 07/07/2005 Ex.: x=acgtatct, y=aggttact The distance d(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and  if no such sequence exists). The different possible operations are: We consider the Hamming distance, that allows only substitutions, which cost 1 (simplified definition). It is finite whenever |x|=|y| and it holds 0  d(x,y)  |x|. Ex.: x=acgtatct, y=aggttact d(x,y)=3 (in the simplified definition) 3) Substitution, 4) Transposition. 1) Insertion, 2) Deletion,

6 Mondello, 07/07/2005 Let S be a string over the alphabet Σ, and let k, r be non negative integers such that k ≤ r. A string u occurs in S at position l up to k errors in a window of size r, or simply k r -occurs in S at position l, if one of the following two conditions hold: − if |u| < r  d(u, S(l, l+|u|-1)) ≤ k; − if |u| ≥ r   i, 1≤ i ≤ |u|-r+1, d(u(i,i+r-1), S(l+i-1, l+i+r-2)) ≤ k. A string u satisfying above property is a k r -occurrence of S. Typical approaches for finding a string x in a text S: to consider a percentage D of errors, or to fix the number k of them. Hybrid approach: to introduce a new parameter r and to allow at most k errors for any window (or factor) of length r. Let L(S,k,r) be the set of words that k r -occurs in S at position l, for some l, 1≤ l ≤ |S|-|u|+1. The parameter r introduced in the previous definition can be fixed or can vary as a function of the text.

7 Mondello, 07/07/2005 Example of L(S,k,r) S=abaa k=1, r=2 L(S,1,2)={a,b,aa,bb,ab,ba,bb,aaa,aab,aba,abb,baa, bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab, bbba} bbba  L(S,1,2), but bbba  L(S,1,4)

8 Mondello, 07/07/2005 The Repetition Index R(S,k,r) of S is the smallest integer h such that all strings of length h k r -occur at most in a unique position of the text: R(S,k,r) = min{h  1 s.t.  i, j, 1  i, j  |S| - h + 1, V(S(i,i+h-1),k,r)  V(S(j,j+h-1),k,r)   i=j}, where V(u,k,r) is the set of all words of length |u| that have at most k errors in every window of size r with respect to u. Remarks: 1.R(S,k,r) is well defined because the integer h=|S| is an element of the set above described; 2.If k/r  1/2 then R(S,k,r)=|S|.

9 Mondello, 07/07/2005 Example Let us consider the string S = a b c d e f g h i j k l m n o a b z d e z g h z j k z m n z with k = 1 and r = 2. k/r = 1/2  R(S,1,2)=|S|=30. A word w of length R(S,1,2)-1=29 that 1 2 -appears at position 1 and 2 is w = a c c e e g g i i k k m m o o b b d d z z h h j j z z n n

10 Mondello, 07/07/2005 Some combinatorial properties of R(S,k,r) Lemma 1: If k and S are fixed, R(S,k,r) is a non- increasing function of r; Lemma 2: If r and S are fixed, R(S,k,r) is a non- decreasing function of k; Lemma 3: If k and S are fixed and r  R(S,k,r), the repetition index gets constant. Theorem If k and S are fixed, there exists only one solution to the equation r = R(S,k,r).

11 Mondello, 07/07/2005 An Index over a fixed text S is an abstract data type which basic set is Fact(S) and that contains operations giving access to factors of S. The principal operations are: 3) Number of occurrences: given x  Fact(S), find the number of occurrences of x in S; 1) Membership: given a word x, say if x  Fact(S); 2) Position: given x  Fact(S), find the left position of its first (resp. last) occurrence in S; 4) List of positions: given x  Fact(S), produce the list occ(x) of the occurrences of x in S. All these operations can easily be extended to the case of approximate string matching.

12 Mondello, 07/07/2005 We give the following results. The size of this indexing data structure is linear times a polylog of the size of the text S on average, i.e. O(|S| log k |S|). For each word x, the time spent by our algorithms for finding the list occ(x) of all k r - occurrences of the word x in the text S is proportional to |x|+|occ(x)| on average.

13 Mondello, 07/07/2005 Description of the indexing data structure Build the trie T(I,k,r) that represents the set of all possible strings having length R(S,k,r) that k r -occur in the string S; Add to any leaf of the trie T(I,k,r) an integer i that is the starting position of the k r -occurrence of S represented by the concatenation of the labels from the root to the leaf i.

14 Mondello, 07/07/2005 Finding all k r -occurrences of a string x in a text S “Read” as long as possible the string x and let q the last visited node i) If q is a leaf and |x|=R(S,k,r)  return i; ii) If q is a leaf and |x|>R(S,k,r)  if x k r -occurs in S at position i then return i else “x is a false positive” iii) If |x|<R(S,k,r)  return occ(x). The list of all k r -occurrences of x has at most one element The list of all k r -occurrences of x can have more than one element In iii) we use the Colored Range Query solution [Muthukrishnan, SODA’02].

15 Mondello, 07/07/2005 Proposition: The overall time for finding all k r -occurrences of a string x in a text S is O(|x|+|occ(x)|). Theorem: The k-mismatch problem on a text S over a fixed alphabet can be settled by a data structure having average size O(|S|∙log k |S|) that answers queries in time O(|x|+|occ(x)|), for any query word x. Results

16 Mondello, 07/07/2005 Conclusions and related work 1. Results of this paper are in PhD Thesis of A. Gabriele [Genuary, 2005] 2. Independently, M. Maass and J. Nowak gave analogous results, by using the same data structure essentially and the CRQ solution [Preprint March, 2005 – CPM June, 2005], but: - a window is not used -it is improved the analysis on the size of the data structure - the technique is extended to edit distance 3. It is still open to find an indexing data structure of linear times a polylog size and searching time O(|x|+|occ(x)|)


Download ppt "Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi."

Similar presentations


Ads by Google