Download presentation
Presentation is loading. Please wait.
1
Parsimony population haplotyping
Giuseppe Lancia University of Udine Romeo Rizzi, Cristina Pinotti University of Trento
2
Polymorphisms A polymorphism is a feature - common to everybody
- not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color
3
Polymorphisms A polymorphism is a feature - common to everybody
- not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color Or blood-type for a feature not visible from outside
4
At DNA level, a polymorphism is a sequence of nucleotides
varying in a population.
5
Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP)
6
Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
7
Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
8
HOMOZYGOUS: same allele on both chromosomes
atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
9
HOMOZYGOUS: same allele on both chromosomes
atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
10
HOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
11
HOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
12
HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
13
HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgt atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
14
HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites ct ag cg at at at ct ag ag cg ag ag ag cg
15
HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites GENOTYPE: “union” of 2 haplotypes ct OcE ag cg at OaE at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg
16
CHANGE OF SYMBOLS: each SNP only two values in a population.
Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE: string over 0,1 GENOTYPE: string over 0,1,2 ct OcE ag cg at OaE at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg
17
CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE: string over 0,1 GENOTYPE: string over 0,1,2 01 02 10 00 11 12 11 11 11 01 22 10 10 20 00 10 10 10 20 10 00
18
CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 0 and 1. Also, call 2 the fact that a site is heterozygous 0 + 0 = --- 1 + 1 = --- 1 1 = = RULES: 01 02 10 00 11 12 11 11 11 01 22 10 10 20 00 10 10 10 20 10 00
19
The haplotype reconstruction problem (from genotypes – which are much cheaper to obtain than haplotypes)
20
The “biological” problem…
011 101 011 000 010 001 111 010 011
21
The “biological” problem…
011 101 011 000 010 001 111 010 011
22
The “biological” problem…
011 000 011 101 010 001 111 010 011
23
The “biological” problem…
#*&$$# !!! 011 000 011 101 010 001 111 010 011
24
The “biological” problem…
#*&$$# !!! 011 000 011 101 010 001 010 101 010 101 111 010 011 010 111
25
The “biological” problem…
011 000 011 101 010 001 010 101 010 101 111 010 011 010 111
26
The “biological” problem…
011 000 011 101 010 001 010 101 010 101 111 010 011 010 111
27
The “biological” problem…
011 000 011 101 010 001 111 010 101 010 101 010 011 010 111
28
The “biological” problem…
011 000 011 101 010 001 111 *$**$& !!! 010 101 *&X*# !!! 010 101 010 011 010 111
29
The “biological” problem…
011 000 011 101 010 001 111 011 111 *$**$& !!! 011 111 000 111 010 101 *&X*# !!! 010 101 010 011 010 111 010
30
The “biological” problem…
011 000 011 101 010 001 111 011 111 011 111 000 111 010 101 010 101 010 011 010 111 010
31
The “biological” problem…
011 000 010 001 111 011 111 011 101 011 111 010 101 000 111 010 101 010 011 010 111 010
32
The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 010 101 010 011 000 111 010 111 010
33
The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 010 101 010 011 010 111 000 111 010
34
The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 011 010 101 010 000 010 011 010 111 010 111 000 111 010
35
The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 011 010 101 010 000 010 011 010 111 010 111 000 111 010
36
The “biological” problem…
011 000 010 001 111 011 111 010 011 011 111 010 101 011 010 011 101 011 011 000 010 101 010 000 010 011 010 111 010 111 000 111 010 000
37
We observe GENOTYPES 011 000 022 010 001 111 011 111 022 010 011 111 221 012 011 111 010 101 211 011 010 222 011 101 012 011 011 000 221 011 022 010 101 010 000 010 011 010 111 222 020 012 212 010 111 000 111 000 212 010 222 000 010
38
We observe GENOTYPES 022 022 111 221 012 211 222 012 221 011 022 222 020 012 212 212 222 000 010
39
PROBLEM: given input GENOTYPE data
21221 11221 11011 22221 00011 INPUT: G = { 11221, 22221, 11011, 21221, 00011}
40
PROBLEM: given input GENOTYPE data
11011 01101 21221 11011 11101 11221 11011 11011 00011 11101 22221 00011 00011 INPUT: G = { 11221, 22221, 11011, 21221, 00011} OUTPUT: H = { 11011, 11101, 00011, 01101} Each genotype is explained by two haplotypes OBJ: The cardinality of H is MINIMUM (Parsimony, aka Okkam’s razor)
41
Other objectives for reconstruction:
Clark’s inference rule (Gusfield, JCB 2001) -Solution fits a perfect phylogeny (Eskin, Halperin, Karp, JBCB 2003) (Bafna, Gusfield, Lancia, Yooseph, JCB 2003)
42
MENU: Prove problem is difficult Give an exact algorithm (ILP) Give approximation algorithms Dessert
43
1. The problem is APX-Hard
Reduction from VERTEX-COVER on graphs G=(V,E) for which (thanks to a theorem by Nemhauser and Trotter, 1975)
44
B A C D E
45
A B C D E * B A C D E
46
A B C D E * AB BC AE DE AD B A C D E
47
A B C D E * AB BC AE DE AD A B C D E B A C D E
48
A B C D E * AB BC AE DE AD A B C D E B A C D E
49
A B C D E * AB BC AE DE AD A 0 B C D E B A C D E
50
A B C D E * AB BC AE DE AD A B C D E B A C D E
51
A B C D E * AB BC AE DE AD A B C D E B A C D E
52
A B C D E * AB BC AE DE AD A B C D E B A C D E G = (V,E) has a node cover of X size k there is a set H of |V| + k haplotypes that explain all genotypes
53
A B C D E * AB BC AE DE AD A B C D E B A C D E G = (V,E) has a node cover of X size k there is a set H of |V| + k haplotypes that explain all genotypes
54
A B C D E * AB BC AE DE AD A B C D E A’ B’ E’ B A C D E G = (V,E) has a node cover of X size k there is a set H of |V| + k haplotypes that explain all genotypes
55
A B C D E * AB BC AE DE AD A B C D E A’ B’ E’ B A C D E It can be shown that a (1 + e)- approximation for Haplotyping would imply a (1 + 3e)- approximation for Vertex Cover
56
2. An exact algorithm based on
Integer Linear Programming
57
Expand your input G in all possible ways
220 022 120
58
Expand your input G in all possible ways
220 022 120 , ,
59
Expand your input G in all possible ways
220 022 120 , , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110}
60
Expand your input G in all possible ways
220 022 120 , , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g))
61
OBJ: min Expand your input G in all possible ways 220 022 120
, , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) OBJ: min
62
Provided that: Expand your input G in all possible ways 220 022 120
, , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) Provided that:
63
and that: Expand your input G in all possible ways 220 022 120
, , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) and that:
64
The resulting Integer Program:
minimize
65
-ILP problem can be solved by Branch and Bound, within a time depending on
Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair)
66
-ILP problem can be solved by Branch and Bound, within a time depending on
Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair) Simulator by R. Hudson (coalescent theory) to simulate haplotypes (w/level of recombination r = 0, 4, 16, 40) 50 Individuals, 10 and 30 SNPs sites (use ILOG CPLEX) Compare with PHASE At levels r <= 16 results same as PHASE (correctness depended on r, r= 0 both are % correct) For 50 individuals, 10 sites and r=40, correctness in 75-95%
67
-ILP problem can be solved by Branch and Bound, within a time depending on
Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair) 15 instances on 30 sites, r=0, size of ILP very variable From 300 vars (0.03 secs) to 135,000 vars (2.5 mins) to 10^6 vars (no optimal found within 30 mins) -Most had 10,000 vars, solved in under 2mins Solved 13/15, accuracy 80-96%. PHASE took much more time to achieve no better accuracy. REDUCTION VARS: 50 indiv, 30 SNPs, r=4, vars: 28,580 “ “ r= “ ,352 129,812 increasing r makes problem simpler (but model less accurate)
68
3. An approximation algorithm based
on Integer Linear Programming and rounding
69
LINEAR PROGRAMMING RELAXATION:
OPT := min
70
LINEAR PROGRAMMING RELAXATION:
LP := min
71
LINEAR PROGRAMMING RELAXATION:
LP := min Clearly, LP <= OPT
72
LP := min LP ROUNDING TO INTEGER:
Assume each genotypes has at most k sites “2”
73
LP := min LP ROUNDING TO INTEGER:
Assume each genotypes has at most k sites “2” Then, each g gives rise to haplotypes, and
74
LP := min LP ROUNDING TO INTEGER:
Assume each genotypes has at most k sites “2” Then, each g gives rise to haplotypes, and The above LP can be solved in POLYNOMIAL TIME
75
LP := min LP ROUNDING TO INTEGER:
Let x* be the optimal (possibly fractional) LP-solution
76
LP := min LP ROUNDING TO INTEGER:
Let x* be the optimal (possibly fractional) LP-solution For each h in H(G), take h in solution S iff
77
LP := min LP ROUNDING TO INTEGER:
Let x* be the optimal (possibly fractional) LP-solution For each h in H(G), take h in solution S iff |S| <= 2^(k-1) LP <= 2^(k-1) OPT
78
LP := min LP ROUNDING TO INTEGER:
Solution is feasible, since, for each g, And hence at least one of
79
LP := min LP ROUNDING TO INTEGER:
Solution is feasible, since, for each g, And hence at least one of This implies also and
80
Sumarizing: there is a 2^(k-1) – approximate algorithm
for the case in which each genotype has at most k heterozygous sites We also have a probabilistic, 2^(k+2) – approximate algorithm which does not use Linear Programming
81
TO DO Better exact algorithms(e.g. Combinatorial Branch and Bound)
Better approximation algorithm (not depending on k, or w/better dependance on k. BTW, any greedy algorithm is a approximation for n genotypes)
82
BYE, EVERYBODY!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.