Presentation is loading. Please wait.

Presentation is loading. Please wait.

L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.

Similar presentations


Presentation on theme: "L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome."— Presentation transcript:

1 L6: Haplotype phasing

2 Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles At each site, each chromosome has one of two alleles Current Genotyping technology doesn’t give phase Current Genotyping technology doesn’t give phase 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0Genotype for the individual

3 Haplotype Phasing  Haplotype Phasing is the resolution of a genotype into the two haplotypes.  Haplotypes increase the power of an association between marker loci and phenotypic traits  Current approaches to Haplotyping  Via technological innovations (expensive)  Statistical Methods (ML, Phase,PL)  Combinatorial approach to the phasing problem  Efficient, provable quality of solution  Not completely generalizable (as yet)

4 Clark’s idea  Using the HWE principle, infer phase using homozygous sites.  Not described as an algorithm, but as a methodology to infer phase. 0 1 1 1 0 0 1 1 0 1 1 0 2 0 0 2 0 0 2 1 2 0 0 0 0 0 0

5 Maximum likelihood estimation of phase  Input: Genotypes 1…m with counts n 1, n 2,..  Output: Haplotype frequencies (also individual haplotype assignments)  Define (unknown) genotype probabilities P 1,P 2,P 3 …  Likelihood Function (based on genotype probabilities)

6 Genotypes and Haploptypes  Let c j be the number of haplotype pairings that will give us genotype j, Then  Use HWE to compute Pr(h k,h l )

7 Likelihood using haplotype frequencies

8 The Expectation Step  Q: Given haplotype frequencies, what are the paired haplotype frequencies  A: Initially  Subsequently, (gth iteration) 

9 The M Step  it is 0, 1, or 2 (# of times haplotype t occurs in paired haplotype t)  it is 0, 1, or 2 (# of times haplotype t occurs in paired haplotype t)

10 Bayesian approach to phasing  Idea: Small variants of common haplotypes should also be considered common even though they have low frequency

11 Phase

12 Phase  As described, each haplotype arises from the prior set only through mutations. Recombination is not considered  In subsequent versions, recombination is explicitly considered in the equation

13 Phase results  Phase versus EM versus Clark  Error rate: Proportion of individuals incorrectly predicted

14 Combinatorial Approach to Haplotyping

15 The Perfect Phylogeny Model  We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed.  In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root. 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 Extant Haplotypes

16 PPH: Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b 2 10 01 00 Haplotyping via Perfect Phylogeny

17 12 a22 b02 c10 12a11 a00 b00 b01 c10 c10 No tree possible for this explanation The Alternative Explanation

18  Arrange the haplotypes in a matrix, two haplotypes for each individual.  Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all four pairs (Buneman): 0,0 and 0,1 and 1,0 and 1,1 0,0 and 0,1 and 1,0 and 1,1 The 4 Gamete Test for Perfect Phylogeny 00 01 11 10

19 The Alternative Explanation 12 a22 b02 c10 12a11 a00 b00 b01 c10 c10 No tree possible for this explanation

20 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b 2 0 0 1 0 The Tree Explanation Again

21 The Combinatorial Problem  Input: A ternary matrix (0,1,2) M with N rows  Output: A binary matrix M’ created from M by replacing each 2 in M with a 0 and 1, such that M’ passes the 4 gamete test  Gusfield (Recomb2002) proposed a solution which used a reduction to Matroids.  We present a (slightly inefficient) solution using elementary techniques  Independently by (Eskin, Halperin, Karp’02)

22 Initial Observations  Forced Expansions:  EX 1: If two columns(sites) of M contain the following rows 2 0 2 0 0 2 0 2 Then M’ will contain a row with 1 0 and a row with 0 1 in those columns.  EX 2: Similarly, if two columns of M contain the rows 2 1 2 1 2 0 2 0 Then M’ will contain rows with 1 1 and 0 0 in those columns Then M’ will contain rows with 1 1 and 0 0 in those columns

23 If a forced expansion of two columns creates rows 0 1, and 1 0 in those columns, then any 2 2 in those columns must be set to be 0 1 1 0 We say that two columns are forced out-of-phase. If a forced expansion of two columns creates 1 1, and 0 0 in those columns, then any 2 2 in those columns must be set to be 1 0 We say that two columns are forced in-phase. Initial Observations 22

24 Immediate Failure It can happen that the forced expansion of cells creates a 4x2 submatrix that fails the 4-Gamete Test. In that case, there is no PPH solution for M. Example: 20 12 02 Will fail the 4-Gamete Test

25 An O(ns^2)-time Algorithm  Find all the forced phase relationships by considering columns in pairs.  Find all the inferred, invariant, phase relationships.  Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred.  Result: An implicit representation of all solutions to the PPH problem.

26 1222000 2020002 1222020 1220200 2200020 0000000 ABCDEFABCDEF 1 2 3 4 5 6 7 A Running Example

27 1 Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase. 4 7 2 5 3 6 1 Companion Graph G_c 1222000 2020002 1222020 1220200 2200020 0000000 ABCDEFABCDEF 1 2 3 4 5 6 7

28 1 7 2 5 3 4 6 Each Red edge indicates that the columns are forced in-phase. Each Blue edge indicates that the columns are forced out-of-phase. Let G_f be the sub-graph of G_c defined by the red and blue edges. Phasing Edges in G_c

29 1 7 2 5 3 4 6. Connected Components in G_f  Graph G_f has three connected components

30 Phase-parity Lemma That’s nice, but how do we assign the colors?  Lemma 1: There is a solution to the PPH problem for M if and only if there is a coloring of the black edges of G_c with the following property: For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges blue (i.e., out of phase) blue (i.e., out of phase)

31 1 A Weak Triangulation Rule  Theorem 1: If there are any black edges whose ends are in the same connected component of G_f, at least one edge is in a triangle where the other edges are not black  In every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges.  This an “inferred” coloring. 3 Graph G_f 7 2 5 4 6

32 3 7 2 5 4 6

33 3 7 2 5 4 6

34 3 7 2 5 4 6

35 3 7 2 5 4 6

36 Corollary  Inside any connected component of G_f, ALL the phase relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pair- wise column comparisons, or by triangle-based inferred colorings.  Hence, the phase relationships of all the columns in a connected component of G_f are INVARIANT over all the solutions to the PPH problem.  The black edges in G_f can be ordered so that the inferred colorings can be done in linear time. Modification of DFS.

37 Phase Parity Lemma: Proof 2 X Y 2 2 2 If X ≠ 2, and Y ≠ 2, Then the two columns are forced

38 Phase Parity Lemma: proof 2 2 y x 2 2 2 z 2 A B C  Lemma: If a triangle contains a black edge, then a PPH solution exists only if there are 0 or 2 blue edges in the final coloring.  Proof:  No black edge unless x==2, or y==2 or z==2 (previous lemma)  If there is a row with all 2s, then there must be an even number of blue edges A C B

39 Proof of Weak Triangulation Theorem  Arbitrary chordless cycles are possible in the graph, with forced edges.  See example. The pattern 0,2; 2,0; and 2,2 implies a blue (out of phase) edge  A single unforced edge changes the picture 22000 02200 00220 00022 20002 A B C D E E D A B C

40 Proof of Weak Triangulation Theorem  Let (J,J’) be a black edge connecting a ‘long’ path J,K,…K’,J’ of forced edges  In the Matrix, x ≠ 2, otherwise there is a chord. Likewise y≠2  By previous lemma, (J,J’) is forced 2 2 x y 2 2 2 2 K J J’ K’ J J’ K K’

41 Finishing the Solution Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored? Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored?

42 1 4 6 7 2 5 3  How should we color the remaining black edges in a connected component C of G_c?

43 Answer For a connected component C of G with k connected components of Gf, select any subset S of k-1 black edges in C, so that S together with the red and blue edges span all the nodes of C. Arbitrarily, color each edge in S either red or blue. Infer the color of any remaining black edges by successive use of the triangle rule. 7 2 5 3 4 6

44 7 2 5 3 4 6

45 Theorem 2  Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining black edges.  Different colorings of S determine different colorings of the remaining black edges.  Each different coloring of S determines a different solution to the PPH problem.  All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.

46 Corollary  In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of M represented by C.  If G_c has r connected components and t connected components of G_f, then there are exactly 2^(t-r) solutions to the PPH problem.  There is one unique PPH solution if and only if each connected component in G is a connected component in G_f.

47 Algorithm  Build Graph G and find its connected components. Solve each connected component C of G separately.  Find the forced (red or blue) edges. Let Gf be the subgraph of C containing colored edges.  Find each connected component of Gf and make the inferred edge colorings (phase decisions).  Find a spanning tree of uncolored edges in C, and color those edges arbitrarily, and follow the inferred edge colorings

48 Conclusion  In the special case of blocks with no recombination, and no recurrent mutations, the haplotypes satisfy a perfect phylogeny  Given a set of genotypes, there is an efficient (O(ns^2)) algorithm for representing all possible haplotype solutions that satisfy a prefect phylogeny  Efficiency:  Input is size O(ns),  All operations except building the graph are O(ns+s^2)  Valid PPH only if s = O(n). Is O(ns) possible?  Current best solution is O(ns+n^(1-e) s^2) using Matrix Multiplication idea  Future work involves combining this with some heuristics to deal with general cases (lo recombination/hi recombination)

49 Simulated Data  Coalescent model (Hudson)  No Recombination  400 chromosomes, 100 sites  Infinite sites  Recombination  100 chromosomes  Infinite sites  R=4.0 2501  Pr(Recombination) = 4*10^(-9) between adjacent bases

50 Error Measurement  Discrepancy = 1 (Num Haplotypes incorrectly predicted)  Switch Error = 2 00101 01010 00000 11111 01010 00101 01010 10101 02222 22222

51 No Recombination

52

53 Choosing between solutions

54

55

56 Conclusion  Extremely low error rates (< 1% discrepancy) if no recombination  Randomly choosing between equivalent solutions is sufficient  Other measures (Parsimony, Likelihood, Entropy) do not improve the quality of solution

57 With Recombination

58 Problems  Many of the earlier problems (structure/recombination rate) etc. correspond to phased data.  Can they be resolved for unphased data


Download ppt "L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome."

Similar presentations


Ads by Google