Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao
Abstract Given k strings of the same length L and an integer d, find a string s such that the hamming distance between s and the k strings are greater than d. NP complete if the distance d is not given. Provide an efficient algorithm for a fixed k and L.
Let’s Begin Input: Strings s 1, s 2,..., s k over alphabet Σ of length L, and a nonnegative integer d. Question: Is there a string s of length L such that d H (s, s i ) >= d for all i = 1,..., k? FARTHEST STRING can be solved in O( kL(|Σ|(L-d)) (L-d) ) time, yielding a bounded search tree algorithm for fixed parameters L and d.
Definitions s : a string with length L S : a set of length L strings d H ( s 1, s 2 ) : the hamming distance between the two strings s 1, s 2.
Key Observation Given a set of binary strings S = { s 1, s 2,..., s k } and a positive integer d. If there are i, j ∈ {1,..., k} with d H (s i, s j ) > x, then there is no string s with min i=1,...,k {d H (s, s i )} > L-x/2. same differen t contribute L-x contribute x/2 L-x +x/2 = L-x/2
Key Observation Given a set of binary strings S = { s 1, s 2,..., s k } and a positive integer d. If there are i, j ∈ {1,..., k} with d H (, s j ) = d. This can be used to discard some of the strings.
The Idea Of The Algorithm Choose a “candidate string” first e.g.. A string s i, i = 2,..., k, that matches with the candidate string in more than L-d positions, we recursively try several ways to move the candidate string “away from” s i. Stop either if the candidate is “too far away” from or if we find a solution. By a careful selection of subcases, we can limit the size of this search tree to O( L-d L-d )
Algorithm By Pseudo-Code In the beginning FSd(,L-d) This means that we moved more than L-d steps. From previous observation Find the answer. Choose an unsatisfied string The positions that “candidate string” and the unsatisfied string have the same alphabet. Change a position once. Recursive call Set of positions Choose a subset of P that has L-d+1 positions
Illustration By Graph the number of the branch nodes The height of the tree In this example, If L = 7,d = 5 ………………………
Pseudo-Code & Time Complexity O(1) O( L-d L-d ) recursive calls O(KL) total = O(KL(L-d L-d ))
Correctness Case 2: is not a solution but there exists a string s that satisfies the condition that min i=1,...,k {d H (s, s i )} ≥ d. Case 1: satisfies min i=1,...,k {d H (s, s i )} ≥ d We have to show that Algorithm FSD will find a string s with min i=1,...,k {d H (s, s i )} ≥ d, if such an s exists. There is a string s i, i = 2,..., k, such that d H (, s i ) < d We will explain why the algorithm creates L-d+1 subcases and prove that it can achieve the correct answer.
Correctness – Case2 1. d H (s, ) ≥ d, this means there are most L-d positions that have the same alphabet between s and. 2. d H (s, s i ) ≥ d, this means there are most L-d positions that have the same alphabet between s and s i. 3. We choose L-d+1 positions that use the same alphabet between and s i. 4. Because s and s i only have at most L-d positions that have the same alphabet, by the pigeon hole theorem we know that at least one position exists that s and s i differ. 5. Choose that position, and the candidate string moves closer to the farthest string. 6. In at most L-d steps, the farthest string is achieved. Take the first recursion as example:
Correctness - Case2 In this example, L = 9, d = 4 same differen t
Farthest String by Maximum Hamming Distance Sum Input: Strings s 1, s 2, …, s k over alphabet Σ of length L. Question: Find a string s ∈ {s 1, s 2,..., s k } that maximizes ?
Naïve Approach The number of 1’s in the first bit is the least of the candidates, so we choose 1 as the minority vote. The number of 0’s in the second bit is the least, so the minority vote is 0. Approach: Select the alphabet that occurs the fewest times in each column. This is the so called minority vote. It still doesn’t work.
The concept of weighted sum Therefore we hope to be able to decide which alphabet in one column would contribute the most in terms of the total hamming distance. Then calculating the sum of hamming distance for every string. To achieve this goal we use an array to record the number of times an alphabet occurs in every column.
Key Observation We have to prove that the total sum of the hamming distance equals the total number of strings minus the times an alphabet appears. If it is proven, then the string with the maximum will be our answer. Definition: num(α, i ) is the times the alphabet αappears at ith column.
Pseudo-code & Time Complexity 1 for p=0 to L 2 for i=0 to k 3 num[s i [p]] += 1 4 farthest = 0 5 dis = 0 6 for i=0 to k 7 temp_dis = 0 8 for p=0 to L 9 temp_dis += k - num[s i [p]] 10 if temp_dis > dis 11 dis = temp_dis 12 farthest = return s farthest Calculating the weighted sum of one string takes O(L) time, and the total time is therefore O(KL). The time needed to calculate the number of times each alphabet occurs in each column and entering it into a 2- dimension array num[] takes O(KL).
Thank You