Download presentation
Presentation is loading. Please wait.
Published byMavis Bruce Modified over 9 years ago
1
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine
2
Outline of talk: 1. Problem definition 2. Parametrized complexity 3. Polynomial cases 4. NP-hardness 5. ILP formulations
3
1. Problem definition
4
We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
5
K(s) = { 010 } We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
6
K(s) = { 010, 100 } We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
7
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
8
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
9
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
10
K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
11
K(s) = { 010, 100, 001, 011} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
12
K(s) = { 010, 100, 001, 011} | K(s) | = 4 We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011
13
K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers 010010011 We are given a string s and a parameter k (e.g., k = 3)
14
K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers 010010011 010010010 S’= We are given a string s and a parameter k (e.g., k = 3)
15
K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers 010010011 010010010 S’= K(s’) = { 010, 100, 001} | K(s’) | = 3 We are given a string s and a parameter k (e.g., k = 3)
16
The Problem : - A string s over an alphabet Ingredients:
17
The Problem : - A string s over an alphabet - A parameter k (k-mer size) Ingredients:
18
The Problem : - A string s over an alphabet - A parameter k (k-mer size) - A budget B Ingredients:
19
The Problem : - A string s over an alphabet - A parameter k (k-mer size) - A budget B Ingredients: Change at most B letters in s so as resulting s’ has as few distinct k-mers as possible Objective:
20
The Problem : - A string s over an alphabet - A parameter k (k-mer size) - A budget B Ingredients: Find a string s’ with d(s,s’) <= B with the smallest number of kmers Objective: s s’
21
Motivation : Curiosity-driven (it’s a cute combinatorial problem)Real:
22
Motivation : Curiosity-driven (it’s a cute combinatorial problem)Real: Analysis of DNA sequencesFictious: atcgattgatccttta atc, tcg, cga, gat, …. 3-mers are aminoacid codons. Protein complexity relates to # of codons. Mutations may reduce complexity….
23
Our results: The problem has many parameters (|s|, | |, k, B), we study all versions (when possibly some of the parameters are bounded) - Polynomial special cases (e.g. for B fixed or both k,| | fixed) - NP-hard special cases (even k=2 or | |=2)
24
2. Parametrized complexity
25
s NO B NO s YES B NO s NO B YES s YES B YES NO k NO YES k NO NO k YES YES k YES
26
s NO B NO s YES B NO s NO B YES s YES B YES We can assume : k <= |s| NO k NO YES k NO NO k YES YES k YES
27
s NO B NO s YES B NO s NO B YES s YES B YES We can assume : k <= |s| NO k NO YES k NO NO k YES YES k YES
28
s NO B NO s YES B NO s NO B YES s YES B YES We can assume : B <= |s| NO k NO YES k NO NO k YES YES k YES
29
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES We can assume : | | <= |s| (we don’t need any symbol not already in s)
30
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES
31
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES Polynomial cases
32
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES NP-hard cases N P-hard for | |=2 N P-hard for k=2
33
3. Polynomial cases
34
The case | | and k fixed:
35
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES N P-hard for | |=2 N P-hard for k=2
36
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
37
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
38
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
39
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
40
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
41
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
42
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
43
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
44
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 Each path corresponds to a string s’ with all its kmers in A We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
45
The case | | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 The length of the path is the Hamming distance d(s’, s) We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
46
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 SUB(A) has a solution iff the shortest path is <= B
47
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A|| ||s|) = O(|s|) since
48
The case | | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A|| ||s|) = O(|s|) since - There are “only” possible subsets A to try… problem is solved in polytime O(|s|)
49
The case of B fixed:
50
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES N P-hard for | |=2 N P-hard for k=2
51
The case of B fixed: For B fixed, we can try all possible solutions. There are possible choices of bits to flip. We can try them all, and count the # of k-mers. Since | |<=|s|, the way to flip them is bounded by
52
4. NP-hardness
53
- Theorem: the problem is NP-hard even for k=2.
54
NO k NO YES k NO NO k YES YES k YES s NO B NO s YES B NO s NO B YES s YES B YES N P-hard for | |=2 N P-hard for k=2
55
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii)
56
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5
57
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5
58
- Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5 ( note that CBS is NP-hard because it includes MAX BALANCED BIP GRAPH, for n = 2t, m = t^2 )
59
CBS: does there existsuch that The reduction: and? a b c d e f g h i l
60
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m
61
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s)
62
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m The only kmers that can be destroyed are of the form x or x, and this is achieved by “flippling” the into a . This corresponds to removing the edge. i j .. ... i j IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s)
63
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m The set of kmers of type x or x which remain define the set X which covers at least m edges IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s) The only kmers that can be destroyed are of the form x or x, and this is achieved by “flippling” the into a . This corresponds to removing the edge. i j .. ... i j
64
CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let = {a,b,c,d,e,f,g,h,i,l} U { , }B = |E| - m IDEA: Encode an edge (i,j) as … i j … and make all k-mers x, x and unavoidable (i.e., insert a LOT of each of them in s) S = a a ...a a b b ...b b... l l l... l a g b g e i
65
-With similar reductions, we can also prove the Theorem: the problem is NP-hard even for | | = 2
66
5. Integer Linear Programming formulations
67
Let K be the set of all possible kmers Define a 0/1 variable in K and a 0/1 variable for each position Exponential-size formulation ( ={0,1}) K={ 000, 001, 010, 011, 100, 101, 110, 111 }
68
min Exponential-size formulation ( ={0,1})
69
min Exponential-size formulation ( ={0,1})
70
min Exponential-size formulation ( ={0,1})
71
Exponential n. of variables/constraints pricing & separation problems Our P&S strategy is exponential in the general case (polynomial for k fixed)
72
Exponential n. of variables/constraints pricing & separation problems Our P&S strategy is exponential in the general case (polynomial for k fixed) Fractional solutions may lead to effective heuristics (e.g., solving SUB(A) with
73
Polynomial-size formulation ( ={0,1})
74
In addition to variables, there are variables for positions i and j saying “is the kmer starting at i identical to the kmer starting at j ? “ The variables w depend on s and z via linear constraints
75
To count only kmers that have no identical kmer following them: if all kmers starting after i are different from kmer at i Polynomial-size formulation ( ={0,1}) In addition to variables, there are variables for positions i and j saying “is the kmer starting at i identical to the kmer starting at j ? “ The variables w depend on s and z via linear constraints
76
Polynomial-size formulation ( ={0,1}) -The formulation (not show) is basically a big boolean formula -It yields a poor bound compared to the exponential formulation -It can be improved in many ways (expecially via valid cuts) -At this stage it’s not clear which method is best for not too small instances -We’ll run experiments with variants of both formulations
77
<EOT>
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.