Fixed-Parameter Algorithms for CLOSEST STRING and Related Problems Algorithmica(2003) Jens Gramm, Rolf Niedermeier, Peter Rossmanith
Outline Introduction Preliminaries Linear-Time solution for constant d Related Problems Linear-Time solution for fixed k Conclusion
Intro : Problem Definition Input: String s 1, s 2, …, s k over alphabet Σ of length L each, and a nonnegative integer d. Question: Is there a string s of length L such that d H (s, s i )≤d for all i=1,…,k d H (s 1, s 2 ) = |{i|s 1 [i]≠s 2 [i]}|, |s 1 |=|s 2 |
NP-completeness CLOSEST STRING is NP-complete d is usually small in biological applications O(kL+kd*d d ) result in this paper PTAS by Li et al
Extended problems d-MISMATCH DISTINGUISHING STRING SELECTION DISTINGUISHING SUBSTRING SELECTION
Preliminaries Given a set of string S={s 1, …,s k }, each of length L s is optimal center string iff no s ’ such that max i=1, …,k d H (s ’,s i )<max i=1, …,k d H (s,s i ) s is optimal median string iff no s ’ such that Σ i=1, …,k d H (s ’,s i )<Σ i=1, …,k d H (s,s i )
Given a set of k strings of length L, think of this string as k x L matrix Optimal median string : a c c a s1abcd s2aadb s3bcda s4accc
Main idea Search! Fixed-parameter tractibility Reduction to problem kernel
LEMMA 1. Given a set of strings S={s 1, …,s k }, each of length L, and a permutationσ:{1,…,L} {1,…,L}. Then s is an optimal center string for {s 1,…,s k } iff σ(s) is an optimal center string for {σ(s 1 ), σ(s 2 ), …, σ(s k )}
LEMMA 2. To compute an optimal center string, it is sufficient to solve a normalized and reordered instance. From this, the solution of the original instance can be derived in linear time s1abcd s2aadb s3bcda s4accc s1abaa s2acbb s3babc s4aaad s1baaa s2cabb s3abbc s4aaad
LEMMA 3. A CLOSEST STRING instance with arbitrary alphabet Σ, |Σ|>k, isomorphic to a CLOSEST STRING instance with alphabet Σ’, |Σ’|=k. By normalization
LEMMA 4. Given a CLOSTEST STRING instance s 1, …,s k of length L and d. If the resulting k x L matrix has more than kd dirty dirty columns, then there is no string s with max i=1, …,k d H (s,s i )≤d A column is dirty iff it contains at least two different symbols from alphabet Σ By pigeon theorem
A Linear-Time solution for constant d Bounded search tree algorithm LEMMA 5. Given a set of strings S={s 1, …,s k } and a positive integer d. If there are i, j {1, …,k} with d H {s i,s j }>2d, then there is no string s with max i=1, …,k d H (s, s i )≤d
Theorem 1. Given a set of string S={s 1, …,s k } and d, Algorithm D determines in O(kL+kd*d d ) time. By lemma 4, reduced the input instance to O(kd) in O(kL) time Depth=d, Time(D0+D1+D2+D3)=kd by building a table containing the distances of candidate s 1 to all other given strings
correctness Show only the correctness of first step If s 1 is not a solution but there exists a center string s P :={p|s 1 [p]≠s i [p]}, |P|=d+1 P s1≠s=s i := {p|s 1 [p]≠s[p]=s i [p]} goal! P s1≠s=si =P s≠si ∪ P (disjoint), |P s≠si |≤d So d+1 subcases is sufficient
Related Problems d-MISMATCH problem S i,p,L denote the length L substring of a given string s i starting at position p Whether there is a string of length L and a position p with 1≤p≤n-L+1, such that d H (s,s i,p,L )≤d, for all I Stojanvoic et al give a linear time algorithm fo 1-MISMATCH Theorem 2. d-MISMATCH is solvable in O(kL+(n- L)kd*d d ) time which O(n*k) for fixed d Naively: O(n*(KL+kd*d d )) Maintain the queue of dirty columns Considering only the first L columns, we can build a FIFO queue in O(kL) Update at each position in O(k) time
DSS problem DISTINGUISHING STRING SELECTION Given S={s 1, …,s k1 }, S ’ ={s ’ 1, …,s ’ k2 } all of the same length L, and d 1,d 2 ≥0, is there a s such that LEMMA 6. Given two set of strings S 1 ={s 1,…,s k1 } and S 2 ={s’ 1,…,s’ k2 } and positive d1,d2. If there are i {1, …,k 1 } and j {1, … k 2 } with d H (s i,s ’ j )<L-(d1+d2), then there is no string s satisfying both max i=1, …,k1 d H (s,s i )≤d 1 and min j=1,…,k2 d H (s,s’ j )≥L-d 2 d H (s,s’ j )≤d H (s,s i )+d H (s i,s’ j )
A Linear-Time Solution for Fixed k Is CLOSEST STRING fixed parameter tractable? Use integer linear programming (ILP) Lenstra: ILP with a fixed number of variables can be solved in linear time(exponential space)
CLOSEST STRING in ILP Column types for k For k=3: (a,a,a) t, (a,a,b) t, (a,b,a) t, (b,a,a) t, (a,b,c) t |column types|=B(k)≤k! X t,φ, t: column type, φ Σ Number of column type t whose corresponding character in the desired solution string of CLOSEST STRING is set to φ B(k)*k Variables needed Minimize Φ t,i denates the alphabet symbol at the i th entry of column type t
Conclusion Fixed parameter tractability for CLOSEST STRING in d, k Improve previous work in d-MISMATCH DSS CLOSEST SUBSTRING ?