Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen Finding approximate palindromes in strings Pattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
Definition S: a string of n characters. S[i]: the ith character in S. S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. SR: the reverse of S. S: abcab SR:bacba
Definition A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba. S[c]: the center of palindrome S[i…j] in S, where . 1 2 3 4 5 6 7 8 c b a S S[2…7]=baccab is an even palindrome and S[c]=4
Edit distance X : A - T Y : A G T X : A C C Y : T C C X: G C A In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position. Substitution: symbols at corresponding positions are distinct. Deletion: a symbol of X is missing in Y at a X : A - T Y : A G T X : A C C Y : T C C X: G C A Y: G - A
denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.
Approximate palindromes An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k. An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.
abaa and aabaa are even approximate palindromes, To simplify our discussion, we only discuss even approximate palindromes here. S: aabaabcd and k=1. 1 2 3 4 5 6 7 8 a b c d S At c=3, abaa and aabaa are even approximate palindromes, Substitute b with a Delete b and aabaa is a maximal approximate palindrome.
Problem Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. For each c, we find the largest i’ and j’ in T[c+1…n] and TR[1…c] respectively such that ED(T[c+1…i’] ), TR[1…j’]) ≦k.
Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n. In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,j is the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1 and S2 are n’ and m’ respectively.
S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5 T: dbcaabac, and k=2. At c=3, S2=TR [1…3] =cbd and S1=T[4…7]=aabac. i j a b c 1 2 3 4 5 d ↖: substitution or a matching ↑: deletion ←: insertion We can find that the maximal approximate palindrome is bcaab.
How can we compute the table faster? In this paper, the method in [LV89]( L.Y. Huang) was used.
We shall heavily use the concept of diagonal. Diagonal d is defined as all of the Di,j’s where d = i – j. The diagonal property: Di,j-Di-1,j-1=0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] 1 2 c b 3 a i 1 2 3 j Diagonal 2 Diagonal 0
Let us now label all of these locations. Consider diagonal d=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0. Let us now label all of these locations. S1=gggtcta S2=gttc 4 c 3 t 2 1 g 7 6 5 a i 1 2 3 4 5 6 7 j Diagonal 0
Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.
Let us consider any (i, j) location on Diagonal d. Di,j can only be influenced as shown below: Di-1, j-1 Di, j-1 substitution delete d+1 Di-1, j Di, j insert d d-1 Thus, we conclude that we only need to consider Diagonals d-1, d and d+1 for each Di,j.
Observe the following two strings: 1 j If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1], then ED(A1+x, A2+y) = k+1.
T1 ab c d 1 i T2 cbd e 1 j Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3.
Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j =e. How can we find the largest row containing the value smaller or equal to k ? We need to let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k.
Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e≦k. Based upon this definition, e is the edit distance between S1[1…i] and S2[1…j] such that i and j are the such largest ones, and S2[ j+1] ≠S1[i+1]. S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4.
How can we compute the Ld,e’s value? We define rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (substitution) (insertion) (deletion) Ld,e= rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1] ≠S2[rowd,e+1].
Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1 Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2. i 1 2 3 4 5 6 7 g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d=0 d=1 d=2
e =1, d = -1 S1=gggtcta S2=gttc How to compute L-1,1? i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a How to compute L-1,1? row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.
S1=gggtcta S2=gttc How to compute L1,2? i 1 2 3 4 5 6 7 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 How to compute L1,2? row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row1,2+1…n’]=S1[4…7]=tcta and S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4.
Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]. How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.
Consider two substrings T1 and T2 as shown below: x T2 A2 S2 y If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.
This paper will use LCA (lowest common ancestor) to find S. When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists. B1 S1 S2 B2 This paper will use LCA (lowest common ancestor) to find S.
Obviously, suffixes S1’ and S2’ have a common prefix S. To find such S, if it exists, we may concatenate S1 and S2 to a new string. S2’ S1’ Obviously, suffixes S1’ and S2’ have a common prefix S.
Let us concatenate S1 and S2 to be a new string as follows: Consider D3,2, the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2. S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 g t c a 1 2 3 4 5 6 7 j 1 2 3 4 d = 1
S1=gggtcta S2=gttc Let us concatenate S1 and S2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.
Algorithm Initialization for all d, 1≦d ≦k+1, d>e, Ld,e=-1 . for all d, -(k+1) ≦d ≦-1,Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 . for all e, -1≦e≦k, Ln’+1,e = -1 Find L0,0= the length of longest common prefix of S1 and S2 For e = 1 to k do For d = -e to e do rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] rowd,e = min(rowd,e,m’) while rowd,e < m’ and row d,e+d <n’ do find t= the length of longest common prefix of S1[d+rowd,e+1…n’] and S2[rowd,e+1…m’]; rowd,e = rowd,e + t; Ld,e = rowd,e.
At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. S1 i 1 2 3 4 5 6 7 S2 g t c a 1 2 3 4 5 6 7 j 1 2 3 4
At d = 0, find the largest j such that S2[1…j] is equal to S1[1 At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j. S1 i 1 2 3 4 5 6 7 S2 4 c 3 t 2 1 g 7 6 5 a j d=0 S2[1] = S1[1], L0,0 =1
e =1, d = -1 S1 S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. L-1,1 = 2
The length of LCA of ggtctagttc and tc is 0.
e =1, d = 0 S1 i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 d = 0 row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L0,1 = 2
The length of LCA of gtctagttc and tc is 0.
the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1 e =1, d = 1 S1 i 1 2 3 4 5 6 7 S2 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 t 3 c 4 d = 1 row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1. the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1
The length of LCA of gtctagttc and ttc is 0.
row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2 e =2, d = 1 S1 i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 d = 1 row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2
We find that the longest common prefix of tc and tctagttc is tc. e =2, d = 1 i 1 2 3 4 5 6 7 g g g t c t a 1 2 3 4 5 6 7 j 1 2 3 4 g 1 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 We find that the longest common prefix of tc and tctagttc is tc. S2’ S1’ L1,2 = row+2=2+2=4
The length of LCA of tctagttc and ttc is 2.
row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 e =2, d = 2 S1 i 1 2 3 4 5 6 7 S2 j 1 2 3 4 c t g 7 6 5 a d = 2 row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 We find that the lenghth of common prefix of ttc and tctagttc is 1. S2’ S1’ L2,2 = row2,2+1=1+1=2
The length of LCA of ttc and tctagttc is 1.
S1=gggtcta S2=gttc S1 S2 T = cttggggtcta and k=2. i 1 2 3 4 5 6 7 j 1 2 3 4 c t g 7 6 5 a S2 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta. cttggggtc is the maximal approximate palindromes.
References [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp.132-137. [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp.157-169.