ON THE EFFICIENCY OF THE HAMMING C-CENTERSTRING PROBLEMS Amihood Amir Liam Roditty Jessica Ficler Oren Sar Shalom
Motivation – the Conference Location Problem
Consensus String Problem Output: Find a point whose maximum Distance from all points is smallest Input: points in space.
Hamming Distance
Consensus String Problem (1-HRC)
History: Frances and Litman [1997]: Problem is NP -complete even for binary alphabets Therefore: 3 directions. 1.Solution for small k. 2.Fixed parameter tractability. 3.Approximation algorithms.
History: Solution for small k: Gramm, Niedermeier, and Rossmanith [2001] (3) Boucher, Brown, and Durocher [2008] (4 binary) A., Landau, Na, Park, Park, and Sim [2009] (3, radius & dist. sum optimization) A., Paryenty, and Roditty [2012] (5 binary, l 2 for all k: l k )
History: Fixed Parameter Tractability for all Parameters: Fixed l : Ben-Dor, Lancia, Perone, and Ravi [1997] Fixed k: Gramm, Niedermeier, and Rossmanith [2003] Fixed d: Sojanovic, Berman, Gumucio, Hardison, and Miller [1997] Lanctot, Li, Ma, Wang, and Zhang [1999] Sze, Lu, and Chen [2004]
History: Approximations: PTAS: Li, Ma, and Wang [2002] – not practical. Rounded LP: Ben-Dor, Lancia, Perone, and Ravi [1997] large number of variables: |Σ| l Chimani, Woste, and Bocker [2011]: can be reduced to: |Σ|( l -1) A., Paryenty, and Roditty [2011]: |T(S)| |Σ| (T(S)= set of column types)
Another Motivation – Clustering. The C-CenterStrings problem Input: 1.Points in space 2.Number c 3.Objective function f. Output: Divide the points to c sets such that for the c consensus strings c 1,c 2,…,c c, f(c 1,c 2,…,c c ) is maximum/minimum.
Three Types of Objective functions: Let HRC (Hamming Radius Clustering) be the consensus string problem defined before. 1. c-HRC: partition into c sets, each of which has center with radius d. 2. c-HRLC: partition into c sets, each of which has center with radius d, but center is part of input set. 3. c-HRSC: partition into c sets, each of which has a center and the sum of the radii does not exceed d.
The Hamming radius c-clustering problem (c-HRC) Example: For the following strings and d=1, we show it belongs to 2-HRC.
The Hamming radius local c-clustering problem (c-HRLC) Example: For the following strings and d=2, we show it belongs to 2- HRLC. Does it belong to 2-HRLC when d=1 ?
The Hamming radius c-clustering sum problem (c-HRSC) Example: For the following strings and d=2, we show it belongs to 2-HRC.
In this Paper: We consider: 1. Parametetrized Complexity, and 2. Approximations Small k is not too meaningful in the context of clustering.
C-CenterString Parameterized Complexity c Fixed k Fixed d Fixed (d=1) d/l and c Fixed l Fixed (l=2) HRC NPC polynomial time NPC polynomial time? HRLC polynomial time ? NPC HRSC NPC polynomial time? ?
Theorem: HRC,HRLC and HRSC can be solved in polynomial time for fixed k. If k≤c then input strings can be assigned to c centers where d=0. Otherwise c<k. There are c k <k k options for partitioning k strings to c sets. - For each set, find the consensus center in polynomial time. - The partition that gives the best result is the optimal solution.
C-CenterString Parameterized Complexity c Fixed k Fixed d Fixed (d=1) d/l and c Fixed l Fixed (l=2) HRC NPC polynomial time NPC polynomial time? HRLC polynomial time ? NPC HRSC NPC polynomial time? ?
Theorem: HRC is NP complete even if the radius is fixed to d = 1. d = 1 and the alphabet is binary By reduction from Vertex Cover For Triangle-Free Graphs Our input: G - Triangle-Free Graph t – size of vertex-cover set
The construction: The c parameter is t. The distance parameter d is Encode edges as bit strings of length |V|. Set the bits of the vertices on the two sides of the edge.
000????
C-CenterString Parameterized Complexity c Fixed k Fixed d Fixed (d=1) d/l and c Fixed l Fixed (l=2) HRC NPC polynomial time NPC polynomial time? HRLC polynomial time ? NPC HRSC NPC polynomial time? ?
Theorem: HRLC is NP complete even if the length is fixed to l =2 We prove by reduction from Minimum Maximal Matching for Bipartite graphs Our input: G – Bipartite Graph t – size of the minimal set that is maximal matching Maximal Matching Minimum Maximal Matching
The construction: The c parameter is t. The distance parameter d is
3254
Move strings [6,2] and [5,2] if there are centers begins in 5 or Change the center to one of the remaining strings We keep going until there are no two centers with common symbol !
Approximation Algorithms 1. A linear-time 4-Approximation for the 2-HRSC problem. 2. A polynomial time 3-Approximation for the 2-HRSC problem. 3. Special case PTAS – by computing the clusters and doing 1-HRC approximation on each cluster.
>2d Lemma
Proof center
If we had a representative from each cluster we can associate the rest of the strings to the appropriate group Now use a known approximation algorithm of 1-HRC, for finding the consensus strings of each cluster >2d
>4d Lemma Cluster c-center
Proof ≤d
Polynomial time approximation algorithm for 2-HRSC problem
Future work 1.We presented a heuristic algorithm that did very well in practice – what is its approximation ratio? 2. There are some gaps in the parameterized complexity table: a. What happens in the HRLC/HRSC cases for fixed d? b. What happens in the HRC/HRSC cases for fixed l? 3. Is there a PTAS for c-HRC? 4. Can we approximate c-HRC using LP? SDP?