L5: Estimating Recombination Rates. Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered.

Slides:



Advertisements
Similar presentations
Iterative Rounding and Iterative Relaxation
Advertisements

Introduction to Algorithms Graph Algorithms
Two Segments Intersect?
Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?
Dynamic Bayesian Networks (DBNs)
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops Authors: Lan Liu, Tao Jiang Univ. California, Riverside.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Prof. Bart Selman Module Probability --- Part e)
CSE182-L17 Clustering Population Genetics: Basics.
Chapter 11: Limitations of Algorithmic Power
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
Physical Mapping of DNA Shanna Terry March 2, 2004.
Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Yaomin Jin Design of Experiments Morris Method.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Greedy Algorithms and Matroids Andreas Klappenecker.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
LIMITATIONS OF ALGORITHM POWER
Chapter 13 Backtracking Introduction The 3-coloring problem
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-free Mendelian Inheritance on a Pedigree Authors: Lan Liu & Tao Jiang,
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Yufeng Wu and Dan Gusfield University of California, Davis
Lap Chi Lau we will only use slides 4 to 19
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
New Characterizations in Turnstile Streams with Applications
Topics in Algorithms Lap Chi Lau.
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
The minimum cost flow problem
Lecture 18: Uniformity Testing Monotonicity Testing
Chapter 5. Optimal Matchings
L4: Counting Recombination events
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Estimating Recombination Rates
3.5 Minimum Cuts in Undirected Graphs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 11 Limitations of Algorithm Power
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Switching Lemmas and Proof Complexity
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

L5: Estimating Recombination Rates

Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered 3 lower bounds on m M  The only exact algorithm that is known is super exponential. Not even an exponential time algorithm is known.  Can we get efficient upper bounds that are tight.  Idea: An R s like method can be used to get an upper bound.

Upper bounds R s bound Procedure Compute_R s (M) If  non-informative column If  non-informative column return (Compute_R s (M-{s})) return (Compute_R s (M-{s})) else if  redundant row else if  redundant row return (Compute_R s (M-{h})) return (Compute_R s (M-{h})) else else return (1 + min h (Compute_R s (M-{h})) Upper Bound Procedure Compute_U(M) if  non-informative column if  non-informative column return (Compute_U(M-{s})) return (Compute_U(M-{s})) else if  redundant row return (Compute_U(M-{h})) return (Compute_U(M-{h}))else return(min h (f(h,M-{h})+Compute_U(M- {h})) return(min h (f(h,M-{h})+Compute_U(M- {h})) Number of recombinations needed to explain h

Many approaches to estimating 

1. Counting methods Rm Rm Rh Rh Rs Rs ARG with min number of recombinations ARG with min number of recombinations These numbers correlate with  but how do we get a value for  given this number These numbers correlate with  but how do we get a value for  given this number These numbers still have value in defining hot-spots of recombination (showing variance in local recombination rates) These numbers still have value in defining hot-spots of recombination (showing variance in local recombination rates) They generally underestimate the true number of recombinations They generally underestimate the true number of recombinations

2. Model based approaches  Full likelihood approaches  Approximate likelihood approaches Fearnhead, Donnelly

Approximate Likelihood approaches  Two locus sampling  4 gamete violation implies recombination.  Generalization Define vector n = {n 00, n 01, n 10, n 11 } for a pair of loci Define vector n = {n 00, n 01, n 10, n 11 } for a pair of loci The distribution of n depends upon ,  The distribution of n depends upon ,  Can we compute Pr(n| ,  )? Then, we can iterate to get the Max likelihood estimator for . Can we compute Pr(n| ,  )? Then, we can iterate to get the Max likelihood estimator for .

Two locus method Generate MANY random ARGs with n= n 00 + n 01 + n 10 + n 11 leaves. Generate MANY random ARGs with n= n 00 + n 01 + n 10 + n 11 leaves. For each ARG, generate the two trees corresponding to the two loci For each ARG, generate the two trees corresponding to the two loci Drop 2 mutations at random, to get a value for n Drop 2 mutations at random, to get a value for n How can you make this more efficient? How can you make this more efficient? Given an ARG (topology), we know the edge pairs that would generate desired n. Given an ARG (topology), we know the edge pairs that would generate desired n.

Two locus estimation

Multi locus estimator  For a site with multiple loci, assume each pair to be independent, each generating a vector n i  Assume recombination rate (per bp) to be constant in the region

Performance of the 2 locus estimator  The composite likelihood estimator performs ‘well’ in practice.  Note that the values of  can be pre-computed making this a fast method.  Note that this plot does not describe the variance

Performancs: 90/10 percentile

Research: 2 locus versus other statistics Q1: Can we use some of the counting based methods as summary statistic? Q1: Can we use some of the counting based methods as summary statistic? It is better than composite likelihood in that It is better than composite likelihood in that It does not assume independence between loci. It does not assume independence between loci. There is a direct linear relationship (expected number of recombination events is  log n) There is a direct linear relationship (expected number of recombination events is  log n) Variation might be better. Variation might be better. Can we compute Pr(R h | ,  ) efficiently? In a sense, it does not matter, because we can pre-compute the numbers. Can we compute Pr(R h | ,  ) efficiently? In a sense, it does not matter, because we can pre-compute the numbers. Incorporate distance constraints in computing these summary statistics. It is reasonable to assume that the rate is constant per bp within a window. Incorporate distance constraints in computing these summary statistics. It is reasonable to assume that the rate is constant per bp within a window.

Research Problem  Recombination hot-spots are NOT correlated between humans and Chimps.  99% sequence identity  Virtually no overlap between hot-spots (generated using pop. Genetics).  What can cause this?  Method  Europeans/Africans share hot-spots  Concordance with sperm typing  Population sub-structure? Not (as shown by structure)  Genomic factors

Genomic factors  Recombination is elevated in GC rich regions  Epigenetic factors (such as acetylation, methylation) that affect chromatin structure might be key.  Yeast is a useful model for studying recombination  In yeast, recombination hotspots can be eliminated by insertion of transposable elements!  Can differential insertion of Alus explain the differences between chimps/humans?

Haplotype Phasing

Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles At each site, each chromosome has one of two alleles Current Genotyping technology doesn’t give phase Current Genotyping technology doesn’t give phase Genotype for the individual

 Why is haplotype phasing important ?

Haplotype Phasing  Haplotype Phasing is the resolution of a genotype into the two haplotypes.  Haplotypes increase the power of an association between marker loci and phenotypic traits  Current approaches to Haplotyping  Via technological innovations (expensive)  Statistical Methods (ML, Phase,PL)  This lecture, we will consider a combinatorial approach to the phasing problem  Efficient, provable quality of solution  Not completely generalizable (as yet)

The Perfect Phylogeny Model  We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed.  In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root Extant Haplotypes

PPH: Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b Haplotyping via Perfect Phylogeny

12 a22 b02 c10 12a11 a00 b00 b01 c10 c10 No tree possible for this explanation The Alternative Explanation

 Arrange the haplotypes in a matrix, two haplotypes for each individual.  Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all four pairs (Buneman): 0,0 and 0,1 and 1,0 and 1,1 0,0 and 0,1 and 1,0 and 1,1 The 4 Gamete Test for Perfect Phylogeny

The Alternative Explanation 12 a22 b02 c10 12a11 a00 b00 b01 c10 c10 No tree possible for this explanation

12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b The Tree Explanation Again

The Combinatorial Problem  Input: A ternary matrix (0,1,2) M with N rows  Output: A binary matrix M’ created from M by replacing each 2 in M with a 0 and 1, such that M’ passes the 4 gamete test  Gusfield (Recomb2002) proposed a solution which used a reduction to Matroids.  We present a (slightly inefficient) solution using elementary techniques  Independently by (Eskin, Halperin, Karp’02)

Initial Observations  Forced Expansions:  EX 1: If two columns(sites) of M contain the following rows Then M’ will contain a row with 1 0 and a row with 0 1 in those columns.  EX 2: Similarly, if two columns of M contain the rows Then M’ will contain rows with 1 1 and 0 0 in those columns Then M’ will contain rows with 1 1 and 0 0 in those columns

If a forced expansion of two columns creates rows 0 1, and 1 0 in those columns, then any 2 2 in those columns must be set to be We say that two columns are forced out-of-phase. If a forced expansion of two columns creates 1 1, and 0 0 in those columns, then any 2 2 in those columns must be set to be 1 0 We say that two columns are forced in-phase. Initial Observations 22

Immediate Failure It can happen that the forced expansion of cells creates a 4x2 submatrix that fails the 4-Gamete Test. In that case, there is no PPH solution for M. Example: Will fail the 4-Gamete Test

An O(ns^2)-time Algorithm  Find all the forced phase relationships by considering columns in pairs.  Find all the inferred, invariant, phase relationships.  Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred.  Result: An implicit representation of all solutions to the PPH problem.

ABCDEFABCDEF A Running Example

1 Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase Companion Graph G_c ABCDEFABCDEF

Each Red edge indicates that the columns are forced in-phase. Each Blue edge indicates that the columns are forced out-of-phase. Let G_f be the sub-graph of G_c defined by the red and blue edges. Phasing Edges in G_c

Connected Components in G_f  Graph G_f has three connected components

Phase-parity Lemma That’s nice, but how do we assign the colors?  Lemma 1: There is a solution to the PPH problem for M if and only if there is a coloring of the black edges of G_c with the following property: For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges blue (i.e., out of phase) blue (i.e., out of phase)

1 A Weak Triangulation Rule  Theorem 1: If there are any black edges whose ends are in the same connected component of G_f, at least one edge is in a triangle where the other edges are not black  In every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges.  This an “inferred” coloring. 3 Graph G_f

Corollary  Inside any connected component of G_f, ALL the phase relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pair- wise column comparisons, or by triangle-based inferred colorings.  Hence, the phase relationships of all the columns in a connected component of G_f are INVARIANT over all the solutions to the PPH problem.  The black edges in G_f can be ordered so that the inferred colorings can be done in linear time. Modification of DFS.

Phase Parity Lemma: Proof 2 X Y If X ≠ 2, and Y ≠ 2, Then the two columns are forced

Phase Parity Lemma: proof 2 2 y x z 2 A B C  Lemma: If a triangle contains a black edge, then a PPH solution exists only if there are 0 or 2 blue edges in the final coloring.  Proof:  No black edge unless x==2, or y==2 or z==2 (previous lemma)  If there is a row with all 2s, then there must be an even number of blue edges A C B

Proof of Weak Triangulation Theorem  Arbitrary chordless cycles are possible in the graph, with forced edges.  See example. The pattern 0,2; 2,0; and 2,2 implies a blue (out of phase) edge  A single unforced edge changes the picture A B C D E E D A B C

Proof of Weak Triangulation Theorem  Let (J,J’) be a black edge connecting a ‘long’ path J,K,…K’,J’ of forced edges  In the Matrix, x ≠ 2, otherwise there is a chord. Likewise y≠2  By previous lemma, (J,J’) is forced 2 2 x y K J J’ K’ J J’ K K’

Finishing the Solution Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored? Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored?

 How should we color the remaining black edges in a connected component C of G_c?

Answer For a connected component C of G with k connected components of Gf, select any subset S of k-1 black edges in C, so that S together with the red and blue edges span all the nodes of C. Arbitrarily, color each edge in S either red or blue. Infer the color of any remaining black edges by successive use of the triangle rule

Theorem 2  Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining black edges.  Different colorings of S determine different colorings of the remaining black edges.  Each different coloring of S determines a different solution to the PPH problem.  All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.

Corollary  In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of M represented by C.  If G_c has r connected components and t connected components of G_f, then there are exactly 2^(t-r) solutions to the PPH problem.  There is one unique PPH solution if and only if each connected component in G is a connected component in G_f.

Algorithm  Build Graph G and find its connected components. Solve each connected component C of G separately.  Find the forced (red or blue) edges. Let Gf be the subgraph of C containing colored edges.  Find each connected component of Gf and make the inferred edge colorings (phase decisions).  Find a spanning tree of uncolored edges in C, and color those edges arbitrarily, and follow the inferred edge colorings

Conclusion  In the special case of blocks with no recombination, and no recurrent mutations, the haplotypes satisfy a perfect phylogeny  Given a set of genotypes, there is an efficient (O(ns^2)) algorithm for representing all possible haplotype solutions that satisfy a prefect phylogeny  Efficiency:  Input is size O(ns),  All operations except building the graph are O(ns+s^2)  Valid PPH only if s = O(n). Is O(ns) possible?  Current best solution is O(ns+n^(1-e) s^2) using Matrix Multiplication idea  Future work involves combining this with some heuristics to deal with general cases (lo recombination/hi recombination)

Simulated Data  Coalescent model (Hudson)  No Recombination  400 chromosomes, 100 sites  Infinite sites  Recombination  100 chromosomes  Infinite sites  R=  Pr(Recombination) = 4*10^(-9) between adjacent bases

Error Measurement  Discrepancy = 1 (Num Haplotypes incorrectly predicted)  Switch Error =

No Recombination

Choosing between solutions

Conclusion  Extremely low error rates (< 1% discrepancy) if no recombination  Randomly choosing between equivalent solutions is sufficient  Other measures (Parsimony, Likelihood, Entropy) do not improve the quality of solution

With Recombination