Presentation is loading. Please wait.

Presentation is loading. Please wait.

L5: Estimating Recombination Rates. Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered.

Similar presentations


Presentation on theme: "L5: Estimating Recombination Rates. Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered."— Presentation transcript:

1 L5: Estimating Recombination Rates

2 Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered 3 lower bounds on m M  The only exact algorithm that is known is super exponential. Not even an exponential time algorithm is known.  Can we get efficient upper bounds that are tight.  Idea: An R s like method can be used to get an upper bound.

3 Upper bounds R s bound Procedure Compute_R s (M) If  non-informative column If  non-informative column return (Compute_R s (M-{s})) return (Compute_R s (M-{s})) else if  redundant row else if  redundant row return (Compute_R s (M-{h})) return (Compute_R s (M-{h})) else else return (1 + min h (Compute_R s (M-{h})) Upper Bound Procedure Compute_U(M) if  non-informative column if  non-informative column return (Compute_U(M-{s})) return (Compute_U(M-{s})) else if  redundant row return (Compute_U(M-{h})) return (Compute_U(M-{h}))else return(min h (f(h,M-{h})+Compute_U(M- {h})) return(min h (f(h,M-{h})+Compute_U(M- {h})) Number of recombinations needed to explain h

4 Many approaches to estimating 

5 1. Counting methods Rm Rm Rh Rh Rs Rs ARG with min number of recombinations ARG with min number of recombinations These numbers correlate with  but how do we get a value for  given this number These numbers correlate with  but how do we get a value for  given this number These numbers still have value in defining hot-spots of recombination (showing variance in local recombination rates) These numbers still have value in defining hot-spots of recombination (showing variance in local recombination rates) They generally underestimate the true number of recombinations They generally underestimate the true number of recombinations

6 2. Model based approaches  Full likelihood approaches  Approximate likelihood approaches Fearnhead, Donnelly

7 Approximate Likelihood approaches  Two locus sampling  4 gamete violation implies recombination.  Generalization Define vector n = {n 00, n 01, n 10, n 11 } for a pair of loci Define vector n = {n 00, n 01, n 10, n 11 } for a pair of loci The distribution of n depends upon ,  The distribution of n depends upon ,  Can we compute Pr(n| ,  )? Then, we can iterate to get the Max likelihood estimator for . Can we compute Pr(n| ,  )? Then, we can iterate to get the Max likelihood estimator for .

8 Two locus method Generate MANY random ARGs with n= n 00 + n 01 + n 10 + n 11 leaves. Generate MANY random ARGs with n= n 00 + n 01 + n 10 + n 11 leaves. For each ARG, generate the two trees corresponding to the two loci For each ARG, generate the two trees corresponding to the two loci Drop 2 mutations at random, to get a value for n Drop 2 mutations at random, to get a value for n How can you make this more efficient? How can you make this more efficient? Given an ARG (topology), we know the edge pairs that would generate desired n. Given an ARG (topology), we know the edge pairs that would generate desired n.

9 Two locus estimation

10 Multi locus estimator  For a site with multiple loci, assume each pair to be independent, each generating a vector n i  Assume recombination rate (per bp) to be constant in the region

11 Performance of the 2 locus estimator  The composite likelihood estimator performs ‘well’ in practice.  Note that the values of  can be pre-computed making this a fast method.  Note that this plot does not describe the variance

12 Performancs: 90/10 percentile

13 Research: 2 locus versus other statistics Q1: Can we use some of the counting based methods as summary statistic? Q1: Can we use some of the counting based methods as summary statistic? It is better than composite likelihood in that It is better than composite likelihood in that It does not assume independence between loci. It does not assume independence between loci. There is a direct linear relationship (expected number of recombination events is  log n) There is a direct linear relationship (expected number of recombination events is  log n) Variation might be better. Variation might be better. Can we compute Pr(R h | ,  ) efficiently? In a sense, it does not matter, because we can pre-compute the numbers. Can we compute Pr(R h | ,  ) efficiently? In a sense, it does not matter, because we can pre-compute the numbers. Incorporate distance constraints in computing these summary statistics. It is reasonable to assume that the rate is constant per bp within a window. Incorporate distance constraints in computing these summary statistics. It is reasonable to assume that the rate is constant per bp within a window.

14 Research Problem  Recombination hot-spots are NOT correlated between humans and Chimps.  99% sequence identity  Virtually no overlap between hot-spots (generated using pop. Genetics).  What can cause this?  Method  Europeans/Africans share hot-spots  Concordance with sperm typing  Population sub-structure? Not (as shown by structure)  Genomic factors

15 Genomic factors  Recombination is elevated in GC rich regions  Epigenetic factors (such as acetylation, methylation) that affect chromatin structure might be key.  Yeast is a useful model for studying recombination  In yeast, recombination hotspots can be eliminated by insertion of transposable elements!  Can differential insertion of Alus explain the differences between chimps/humans?

16 Haplotype Phasing

17 Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles At each site, each chromosome has one of two alleles Current Genotyping technology doesn’t give phase Current Genotyping technology doesn’t give phase 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0Genotype for the individual

18  Why is haplotype phasing important ?

19 Haplotype Phasing  Haplotype Phasing is the resolution of a genotype into the two haplotypes.  Haplotypes increase the power of an association between marker loci and phenotypic traits  Current approaches to Haplotyping  Via technological innovations (expensive)  Statistical Methods (ML, Phase,PL)  This lecture, we will consider a combinatorial approach to the phasing problem  Efficient, provable quality of solution  Not completely generalizable (as yet)

20 The Perfect Phylogeny Model  We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed.  In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root. 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 Extant Haplotypes

21 PPH: Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b 2 10 01 00 Haplotyping via Perfect Phylogeny

22 12 a22 b02 c10 12a11 a00 b00 b01 c10 c10 No tree possible for this explanation The Alternative Explanation

23  Arrange the haplotypes in a matrix, two haplotypes for each individual.  Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all four pairs (Buneman): 0,0 and 0,1 and 1,0 and 1,1 0,0 and 0,1 and 1,0 and 1,1 The 4 Gamete Test for Perfect Phylogeny 00 01 11 10

24 The Alternative Explanation 12 a22 b02 c10 12a11 a00 b00 b01 c10 c10 No tree possible for this explanation

25 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b 2 0 0 1 0 The Tree Explanation Again

26 The Combinatorial Problem  Input: A ternary matrix (0,1,2) M with N rows  Output: A binary matrix M’ created from M by replacing each 2 in M with a 0 and 1, such that M’ passes the 4 gamete test  Gusfield (Recomb2002) proposed a solution which used a reduction to Matroids.  We present a (slightly inefficient) solution using elementary techniques  Independently by (Eskin, Halperin, Karp’02)

27 Initial Observations  Forced Expansions:  EX 1: If two columns(sites) of M contain the following rows 2 0 2 0 0 2 0 2 Then M’ will contain a row with 1 0 and a row with 0 1 in those columns.  EX 2: Similarly, if two columns of M contain the rows 2 1 2 1 2 0 2 0 Then M’ will contain rows with 1 1 and 0 0 in those columns Then M’ will contain rows with 1 1 and 0 0 in those columns

28 If a forced expansion of two columns creates rows 0 1, and 1 0 in those columns, then any 2 2 in those columns must be set to be 0 1 1 0 We say that two columns are forced out-of-phase. If a forced expansion of two columns creates 1 1, and 0 0 in those columns, then any 2 2 in those columns must be set to be 1 0 We say that two columns are forced in-phase. Initial Observations 22

29 Immediate Failure It can happen that the forced expansion of cells creates a 4x2 submatrix that fails the 4-Gamete Test. In that case, there is no PPH solution for M. Example: 20 12 02 Will fail the 4-Gamete Test

30 An O(ns^2)-time Algorithm  Find all the forced phase relationships by considering columns in pairs.  Find all the inferred, invariant, phase relationships.  Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred.  Result: An implicit representation of all solutions to the PPH problem.

31 1222000 2020002 1222020 1220200 2200020 0000000 ABCDEFABCDEF 1 2 3 4 5 6 7 A Running Example

32 1 Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase. 4 7 2 5 3 6 1 Companion Graph G_c 1222000 2020002 1222020 1220200 2200020 0000000 ABCDEFABCDEF 1 2 3 4 5 6 7

33 1 7 2 5 3 4 6 Each Red edge indicates that the columns are forced in-phase. Each Blue edge indicates that the columns are forced out-of-phase. Let G_f be the sub-graph of G_c defined by the red and blue edges. Phasing Edges in G_c

34 1 7 2 5 3 4 6. Connected Components in G_f  Graph G_f has three connected components

35 Phase-parity Lemma That’s nice, but how do we assign the colors?  Lemma 1: There is a solution to the PPH problem for M if and only if there is a coloring of the black edges of G_c with the following property: For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges blue (i.e., out of phase) blue (i.e., out of phase)

36 1 A Weak Triangulation Rule  Theorem 1: If there are any black edges whose ends are in the same connected component of G_f, at least one edge is in a triangle where the other edges are not black  In every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges.  This an “inferred” coloring. 3 Graph G_f 7 2 5 4 6

37 3 7 2 5 4 6

38 3 7 2 5 4 6

39 3 7 2 5 4 6

40 3 7 2 5 4 6

41 Corollary  Inside any connected component of G_f, ALL the phase relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pair- wise column comparisons, or by triangle-based inferred colorings.  Hence, the phase relationships of all the columns in a connected component of G_f are INVARIANT over all the solutions to the PPH problem.  The black edges in G_f can be ordered so that the inferred colorings can be done in linear time. Modification of DFS.

42 Phase Parity Lemma: Proof 2 X Y 2 2 2 If X ≠ 2, and Y ≠ 2, Then the two columns are forced

43 Phase Parity Lemma: proof 2 2 y x 2 2 2 z 2 A B C  Lemma: If a triangle contains a black edge, then a PPH solution exists only if there are 0 or 2 blue edges in the final coloring.  Proof:  No black edge unless x==2, or y==2 or z==2 (previous lemma)  If there is a row with all 2s, then there must be an even number of blue edges A C B

44 Proof of Weak Triangulation Theorem  Arbitrary chordless cycles are possible in the graph, with forced edges.  See example. The pattern 0,2; 2,0; and 2,2 implies a blue (out of phase) edge  A single unforced edge changes the picture 22000 02200 00220 00022 20002 A B C D E E D A B C

45 Proof of Weak Triangulation Theorem  Let (J,J’) be a black edge connecting a ‘long’ path J,K,…K’,J’ of forced edges  In the Matrix, x ≠ 2, otherwise there is a chord. Likewise y≠2  By previous lemma, (J,J’) is forced 2 2 x y 2 2 2 2 K J J’ K’ J J’ K K’

46 Finishing the Solution Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored? Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored?

47 1 4 6 7 2 5 3  How should we color the remaining black edges in a connected component C of G_c?

48 Answer For a connected component C of G with k connected components of Gf, select any subset S of k-1 black edges in C, so that S together with the red and blue edges span all the nodes of C. Arbitrarily, color each edge in S either red or blue. Infer the color of any remaining black edges by successive use of the triangle rule. 7 2 5 3 4 6

49 7 2 5 3 4 6

50 Theorem 2  Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining black edges.  Different colorings of S determine different colorings of the remaining black edges.  Each different coloring of S determines a different solution to the PPH problem.  All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.

51 Corollary  In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of M represented by C.  If G_c has r connected components and t connected components of G_f, then there are exactly 2^(t-r) solutions to the PPH problem.  There is one unique PPH solution if and only if each connected component in G is a connected component in G_f.

52 Algorithm  Build Graph G and find its connected components. Solve each connected component C of G separately.  Find the forced (red or blue) edges. Let Gf be the subgraph of C containing colored edges.  Find each connected component of Gf and make the inferred edge colorings (phase decisions).  Find a spanning tree of uncolored edges in C, and color those edges arbitrarily, and follow the inferred edge colorings

53 Conclusion  In the special case of blocks with no recombination, and no recurrent mutations, the haplotypes satisfy a perfect phylogeny  Given a set of genotypes, there is an efficient (O(ns^2)) algorithm for representing all possible haplotype solutions that satisfy a prefect phylogeny  Efficiency:  Input is size O(ns),  All operations except building the graph are O(ns+s^2)  Valid PPH only if s = O(n). Is O(ns) possible?  Current best solution is O(ns+n^(1-e) s^2) using Matrix Multiplication idea  Future work involves combining this with some heuristics to deal with general cases (lo recombination/hi recombination)

54 Simulated Data  Coalescent model (Hudson)  No Recombination  400 chromosomes, 100 sites  Infinite sites  Recombination  100 chromosomes  Infinite sites  R=4.0 2501  Pr(Recombination) = 4*10^(-9) between adjacent bases

55 Error Measurement  Discrepancy = 1 (Num Haplotypes incorrectly predicted)  Switch Error = 2 00101 01010 00000 11111 01010 00101 01010 10101 02222 22222

56 No Recombination

57

58 Choosing between solutions

59

60

61 Conclusion  Extremely low error rates (< 1% discrepancy) if no recombination  Randomly choosing between equivalent solutions is sufficient  Other measures (Parsimony, Likelihood, Entropy) do not improve the quality of solution

62 With Recombination


Download ppt "L5: Estimating Recombination Rates. Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered."

Similar presentations


Ads by Google