Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine
Outline of talk: 1. Problem definition 2. Parametrized complexity 3. Polynomial cases 4. NP-hardness 5. ILP formulations
1. Problem definition
We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010 } We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100 } We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001, 011} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001, 011} | K(s) | = 4 We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s)
K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers We are given a string s and a parameter k (e.g., k = 3)
K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers S’= We are given a string s and a parameter k (e.g., k = 3)
K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers S’= K(s’) = { 010, 100, 001} | K(s’) | = 3 We are given a string s and a parameter k (e.g., k = 3)
The Problem : - A string s over an alphabet Ingredients:
The Problem : - A string s over an alphabet - A parameter k (k-mer size) Ingredients:
The Problem : - A string s over an alphabet - A parameter k (k-mer size) - A budget B Ingredients:
The Problem : - A string s over an alphabet - A parameter k (k-mer size) - A budget B Ingredients: Change at most B letters in s so as resulting s’ has as few distinct k-mers as possible Objective:
The Problem : - A string s over an alphabet - A parameter k (k-mer size) - A budget B Ingredients: Find a string s’ with d(s,s’) <= B with the smallest number of kmers Objective: s s’
Motivation : Curiosity-driven (it’s a cute combinatorial problem)Real:
Motivation : Curiosity-driven (it’s a cute combinatorial problem)Real: Analysis of DNA sequencesFictious: atcgattgatccttta atc, tcg, cga, gat, …. 3-mers are aminoacid codons. Protein complexity relates to # of codons. Mutations may reduce complexity….
Our results: The problem has many parameters (|s|, | |, k, B), we study all versions (when possibly some of the parameters are bounded) - Polynomial special cases (e.g. for B fixed or both k,| | fixed) - NP-hard special cases (even k=2 or | |=2)
2. Parametrized complexity
s NO B NO s YES B NO s NO B YES s YES B YES NO k NO YES k NO NO k YES YES k YES
s NO B NO s YES B NO s NO B YES s YES B YES We can assume : k <= |s| NO k NO YES k NO NO k YES YES k YES
s NO B NO s YES B NO s NO B YES s YES B YES We can assume : k <= |s| NO k NO YES k NO NO k YES YES k YES
s NO B NO s YES B NO s NO B YES s YES B YES We can assume : B <= |s| NO k NO YES k NO NO k YES YES k YES
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES We can assume : | | <= |s| (we don’t need any symbol not already in s)
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES Polynomial cases
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES NP-hard cases N P-hard for | |=2 N P-hard for k=2
3. Polynomial cases
The case | | and k fixed:
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES N P-hard for | |=2 N P-hard for k=2
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = 3 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… … We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… … We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… … We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… … Each path corresponds to a string s’ with all its kmers in A We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: s = A = 0100, 1001, 0010, 0001 B = …… … The length of the path is the Hamming distance d(s’, s) We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = A = 0100, 1001, 0010, 0001 B = …… … SUB(A) has a solution iff the shortest path is <= B
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A|| ||s|) = O(|s|) since
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A|| ||s|) = O(|s|) since - There are “only” possible subsets A to try… problem is solved in polytime O(|s|)
The case of B fixed:
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES N P-hard for | |=2 N P-hard for k=2
The case of B fixed: For B fixed, we can try all possible solutions. There are possible choices of bits to flip. We can try them all, and count the # of k-mers. Since | |<=|s|, the way to flip them is bounded by
4. NP-hardness
- Theorem: the problem is NP-hard even for k=2.
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES N P-hard for | |=2 N P-hard for k=2
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii)
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5 ( note that CBS is NP-hard because it includes MAX BALANCED BIP GRAPH, for n = 2t, m = t^2 )
CBS: does there existsuch that The reduction: and? a b c d e f g h i l
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s)
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m The only kmers that can be destroyed are of the form x or x, and this is achieved by “flippling” the into a . This corresponds to removing the edge. i j .. ... i j IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s)
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m The set of kmers of type x or x which remain define the set X which covers at least m edges IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s) The only kmers that can be destroyed are of the form x or x, and this is achieved by “flippling” the into a . This corresponds to removing the edge. i j .. ... i j
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s) S = a a ...a a b b ...b b... l l l... l a g b g e i
-With similar reductions, we can also prove the Theorem: the problem is NP-hard even for | | = 2
5. Integer Linear Programming formulations
Let K be the set of all possible kmers Define a 0/1 variable in K and a 0/1 variable for each position Exponential-size formulation ( ={0,1}) K={ 000, 001, 010, 011, 100, 101, 110, 111 }
min Exponential-size formulation ( ={0,1})
min Exponential-size formulation ( ={0,1})
min Exponential-size formulation ( ={0,1})
Exponential n. of variables/constraints pricing & separation problems Our P&S strategy is exponential in the general case (polynomial for k fixed)
Exponential n. of variables/constraints pricing & separation problems Our P&S strategy is exponential in the general case (polynomial for k fixed) Fractional solutions may lead to effective heuristics (e.g., solving SUB(A) with
Polynomial-size formulation ( ={0,1})
In addition to variables, there are variables for positions i and j saying “is the kmer starting at i identical to the kmer starting at j ? “ The variables w depend on s and z via linear constraints
To count only kmers that have no identical kmer following them: if all kmers starting after i are different from kmer at i Polynomial-size formulation ( ={0,1}) In addition to variables, there are variables for positions i and j saying “is the kmer starting at i identical to the kmer starting at j ? “ The variables w depend on s and z via linear constraints
Polynomial-size formulation ( ={0,1}) -The formulation (not show) is basically a big boolean formula -It yields a poor bound compared to the exponential formulation -It can be improved in many ways (expecially via valid cuts) -At this stage it’s not clear which method is best for not too small instances -We’ll run experiments with variants of both formulations
<EOT>