Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parsimony population haplotyping

Similar presentations


Presentation on theme: "Parsimony population haplotyping"— Presentation transcript:

1 Parsimony population haplotyping
Giuseppe Lancia University of Udine Romeo Rizzi, Cristina Pinotti University of Trento

2 Polymorphisms A polymorphism is a feature - common to everybody
- not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color

3 Polymorphisms A polymorphism is a feature - common to everybody
- not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color Or blood-type for a feature not visible from outside

4 At DNA level, a polymorphism is a sequence of nucleotides
varying in a population.

5 Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP)

6 Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

7 Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

8 HOMOZYGOUS: same allele on both chromosomes
atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

9 HOMOZYGOUS: same allele on both chromosomes
atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

10 HOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

11 HOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

12 HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

13 HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgt atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

14 HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites ct ag cg at at at ct ag ag cg ag ag ag cg

15 HAPLOTYPE: chromosome content at SNP sites
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites GENOTYPE: “union” of 2 haplotypes ct OcE ag cg at OaE at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg

16 CHANGE OF SYMBOLS: each SNP only two values in a population.
Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE: string over 0,1 GENOTYPE: string over 0,1,2 ct OcE ag cg at OaE at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg

17 CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE: string over 0,1 GENOTYPE: string over 0,1,2 01 02 10 00 11 12 11 11 11 01 22 10 10 20 00 10 10 10 20 10 00

18 CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 0 and 1. Also, call 2 the fact that a site is heterozygous 0 + 0 = --- 1 + 1 = --- 1 1 = = RULES: 01 02 10 00 11 12 11 11 11 01 22 10 10 20 00 10 10 10 20 10 00

19 The haplotype reconstruction problem (from genotypes – which are much cheaper to obtain than haplotypes)

20 The “biological” problem…
011 101 011 000 010 001 111 010 011

21 The “biological” problem…
011 101 011 000 010 001 111 010 011

22 The “biological” problem…
011 000 011 101 010 001 111 010 011

23 The “biological” problem…
#*&$$# !!! 011 000 011 101 010 001 111 010 011

24 The “biological” problem…
#*&$$# !!! 011 000 011 101 010 001 010 101 010 101 111 010 011 010 111

25 The “biological” problem…
011 000 011 101 010 001 010 101 010 101 111 010 011 010 111

26 The “biological” problem…
011 000 011 101 010 001 010 101 010 101 111 010 011 010 111

27 The “biological” problem…
011 000 011 101 010 001 111 010 101 010 101 010 011 010 111

28 The “biological” problem…
011 000 011 101 010 001 111 *$**$& !!! 010 101 *&X*# !!! 010 101 010 011 010 111

29 The “biological” problem…
011 000 011 101 010 001 111 011 111 *$**$& !!! 011 111 000 111 010 101 *&X*# !!! 010 101 010 011 010 111 010

30 The “biological” problem…
011 000 011 101 010 001 111 011 111 011 111 000 111 010 101 010 101 010 011 010 111 010

31 The “biological” problem…
011 000 010 001 111 011 111 011 101 011 111 010 101 000 111 010 101 010 011 010 111 010

32 The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 010 101 010 011 000 111 010 111 010

33 The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 010 101 010 011 010 111 000 111 010

34 The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 011 010 101 010 000 010 011 010 111 010 111 000 111 010

35 The “biological” problem…
011 000 010 001 111 011 111 011 111 010 101 011 101 011 010 101 010 000 010 011 010 111 010 111 000 111 010

36 The “biological” problem…
011 000 010 001 111 011 111 010 011 011 111 010 101 011 010 011 101 011 011 000 010 101 010 000 010 011 010 111 010 111 000 111 010 000

37 We observe GENOTYPES 011 000 022 010 001 111 011 111 022 010 011 111 221 012 011 111 010 101 211 011 010 222 011 101 012 011 011 000 221 011 022 010 101 010 000 010 011 010 111 222 020 012 212 010 111 000 111 000 212 010 222 000 010

38 We observe GENOTYPES 022 022 111 221 012 211 222 012 221 011 022 222 020 012 212 212 222 000 010

39 PROBLEM: given input GENOTYPE data
21221 11221 11011 22221 00011 INPUT: G = { 11221, 22221, 11011, 21221, 00011}

40 PROBLEM: given input GENOTYPE data
11011 01101 21221 11011 11101 11221 11011 11011 00011 11101 22221 00011 00011 INPUT: G = { 11221, 22221, 11011, 21221, 00011} OUTPUT: H = { 11011, 11101, 00011, 01101} Each genotype is explained by two haplotypes OBJ: The cardinality of H is MINIMUM (Parsimony, aka Okkam’s razor)

41 Other objectives for reconstruction:
Clark’s inference rule (Gusfield, JCB 2001) -Solution fits a perfect phylogeny (Eskin, Halperin, Karp, JBCB 2003) (Bafna, Gusfield, Lancia, Yooseph, JCB 2003)

42 MENU: Prove problem is difficult Give an exact algorithm (ILP) Give approximation algorithms Dessert

43 1. The problem is APX-Hard
Reduction from VERTEX-COVER on graphs G=(V,E) for which (thanks to a theorem by Nemhauser and Trotter, 1975)

44 B A C D E

45 A B C D E * B A C D E

46 A B C D E * AB BC AE DE AD B A C D E

47 A B C D E * AB BC AE DE AD A B C D E B A C D E

48 A B C D E * AB BC AE DE AD A B C D E B A C D E

49 A B C D E * AB BC AE DE AD A 0 B C D E B A C D E

50 A B C D E * AB BC AE DE AD A B C D E B A C D E

51 A B C D E * AB BC AE DE AD A B C D E B A C D E

52 A B C D E * AB BC AE DE AD A B C D E B A C D E G = (V,E) has a node cover of X size k  there is a set H of |V| + k haplotypes that explain all genotypes

53 A B C D E * AB BC AE DE AD A B C D E B A C D E G = (V,E) has a node cover of X size k  there is a set H of |V| + k haplotypes that explain all genotypes

54 A B C D E * AB BC AE DE AD A B C D E A’ B’ E’ B A C D E G = (V,E) has a node cover of X size k  there is a set H of |V| + k haplotypes that explain all genotypes

55 A B C D E * AB BC AE DE AD A B C D E A’ B’ E’ B A C D E It can be shown that a (1 + e)- approximation for Haplotyping would imply a (1 + 3e)- approximation for Vertex Cover

56 2. An exact algorithm based on
Integer Linear Programming

57 Expand your input G in all possible ways
220 022 120

58 Expand your input G in all possible ways
220 022 120 , ,

59 Expand your input G in all possible ways
220 022 120 , , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110}

60 Expand your input G in all possible ways
220 022 120 , , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g))

61 OBJ: min Expand your input G in all possible ways 220 022 120
, , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) OBJ: min

62 Provided that: Expand your input G in all possible ways 220 022 120
, , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) Provided that:

63 and that: Expand your input G in all possible ways 220 022 120
, , This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) and that:

64 The resulting Integer Program:
minimize

65 -ILP problem can be solved by Branch and Bound, within a time depending on
Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair)

66 -ILP problem can be solved by Branch and Bound, within a time depending on
Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair) Simulator by R. Hudson (coalescent theory) to simulate haplotypes (w/level of recombination r = 0, 4, 16, 40) 50 Individuals, 10 and 30 SNPs sites (use ILOG CPLEX) Compare with PHASE At levels r <= 16 results same as PHASE (correctness depended on r, r= 0 both are % correct) For 50 individuals, 10 sites and r=40, correctness in 75-95%

67 -ILP problem can be solved by Branch and Bound, within a time depending on
Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair) 15 instances on 30 sites, r=0, size of ILP very variable From 300 vars (0.03 secs) to 135,000 vars (2.5 mins) to 10^6 vars (no optimal found within 30 mins) -Most had 10,000 vars, solved in under 2mins Solved 13/15, accuracy 80-96%. PHASE took much more time to achieve no better accuracy. REDUCTION VARS: 50 indiv, 30 SNPs, r=4, vars: 28,580  “ “ r= “ ,352  129,812 increasing r makes problem simpler (but model less accurate)

68 3. An approximation algorithm based
on Integer Linear Programming and rounding

69 LINEAR PROGRAMMING RELAXATION:
OPT := min

70 LINEAR PROGRAMMING RELAXATION:
LP := min

71 LINEAR PROGRAMMING RELAXATION:
LP := min Clearly, LP <= OPT

72 LP := min LP ROUNDING TO INTEGER:
Assume each genotypes has at most k sites “2”

73 LP := min LP ROUNDING TO INTEGER:
Assume each genotypes has at most k sites “2” Then, each g gives rise to haplotypes, and

74 LP := min LP ROUNDING TO INTEGER:
Assume each genotypes has at most k sites “2” Then, each g gives rise to haplotypes, and The above LP can be solved in POLYNOMIAL TIME

75 LP := min LP ROUNDING TO INTEGER:
Let x* be the optimal (possibly fractional) LP-solution

76 LP := min LP ROUNDING TO INTEGER:
Let x* be the optimal (possibly fractional) LP-solution For each h in H(G), take h in solution S iff

77 LP := min LP ROUNDING TO INTEGER:
Let x* be the optimal (possibly fractional) LP-solution For each h in H(G), take h in solution S iff |S| <= 2^(k-1) LP <= 2^(k-1) OPT

78 LP := min LP ROUNDING TO INTEGER:
Solution is feasible, since, for each g, And hence at least one of

79 LP := min LP ROUNDING TO INTEGER:
Solution is feasible, since, for each g, And hence at least one of This implies also and

80 Sumarizing: there is a 2^(k-1) – approximate algorithm
for the case in which each genotype has at most k heterozygous sites We also have a probabilistic, 2^(k+2) – approximate algorithm which does not use Linear Programming

81 TO DO Better exact algorithms(e.g. Combinatorial Branch and Bound)
Better approximation algorithm (not depending on k, or w/better dependance on k. BTW, any greedy algorithm is a approximation for n genotypes)

82 BYE, EVERYBODY!


Download ppt "Parsimony population haplotyping"

Similar presentations


Ads by Google