L4: Counting Recombination events

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Evaluating Hypotheses
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
L5: Estimating Recombination Rates. Review  m M : min. number of recombination events in any explanation of the haplotypes in M  Last time, we covered.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Wi’08Structure Population sub-structure. Wi’08Structure Projects Harish/Nitin Gaurav (Tuesday) Stefano/Hossein (Tuesday) Nisha/Yu David Jian/Josue (Tuesday)
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Lecture 11: Linkage Analysis IV Date: 10/01/02  linkage grouping  locus ordering  confidence in locus ordering.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
The Haplotype Blocks Problems Wu Ling-Yun
Yufeng Wu and Dan Gusfield University of California, Davis
Unsupervised Learning
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Distance based phylogenetics
Constrained Hidden Markov Models for Population-based Haplotyping
The minimum cost flow problem
Minimum Spanning Tree 8/7/2018 4:26 AM
CSE 280A: Advanced Topics in Computational Molecular Biology
Chapter 5. Optimal Matchings
Greedy Algorithms / Interval Scheduling Yin Tat Lee
Estimating Recombination Rates
The ‘V’ in the Tajima D equation is:
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Vineet Bafna/Pavel Pevzner
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Proportioning Whole-Genome Single-Nucleotide–Polymorphism Diversity for the Identification of Geographic Population Structure and Genetic Ancestry  Oscar.
Outline Cancer Progression Models
Goals: To identify subpopulations (subsets of the sample with distinct allele frequencies) To assign individuals (probabilistically) to subpopulations.
Approximation Algorithms for the Selection of Robust Tag SNPs
Clustering.
Approximation Algorithms for the Selection of Robust Tag SNPs
Unsupervised Learning
Presentation transcript:

L4: Counting Recombination events

Algorithm:Structure Iteratively estimate (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m)) After ‘convergence’, Z(m) is the answer. Iteration Guess Z(0) For m = 1,2,.. Sample P(m) from Pr(P | X, Z(m-1)) Sample Z(m) from Pr(Z | X, P(m)) How is this sampling done?

Allowing for admixture Define qi,k as the fraction of individual i that originated from population k. Iteration Guess Z(0) For m = 1,2,.. Sample P(m),Q(m) from Pr(P,Q | X, Z(m-1)) Sample Z(m) from Pr(Z | X, P(m),Q(m))

Estimating Z (admixture case) Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q) i,1 i,2 j

Results on admixture prediction: simulated data

Results: Thrush data For each individual, q(i) is plotted as the distance to the opposite side of the triangle. The assignment is reliable, and there is evidence of admixture.

Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Oceania Eurasia East Asia Africa America

NJ versus Structure:thrush data Objective function is different in standard clustering algorithms!

Population sub-structure:research problem Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem: (w/out admixture). Assign individuals to sub-populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations (w/ admixture) Assign (individuals, loci) to sub-populations

Admixture mapping

Estimating Recombination Rates

Recombination in human chromosome 22 (Mb scale) Dawson et al. Nature 2002 Q: Can we give a direct count of the number of the recombination events?

Recombination hot-spots (fine scale)

Recombination rates (chimp/human) Fine scale recombination rates differ between chimp and human The six hot-spots seen in human are not seen in chimp

Combinatorial Bounds for estimating recombination rate Recall that expected #recombinations =  log n Procedure Generate N random ARGs that results in the given sample Compute mean of the number of recombinations Alternatively, generate a summary statistic s from the population. For each , generate many populations, and compute the mean and variance of s (This only needs to be done once). Use this to select the most likely  What is the correct summary statistic? Today, we talk about the min. number of recombination events as a possible summary statistic. It is not the most natural, but it is the most interesting computationally.

The Infinite Sites Assumption & the 4 gamete condition 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 8 5 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 Consider a history without recombination. No pair of sites shows all four gametes 00,01,10,11. A pair of sites with all 4 gametes implies a recombination event

Hudson & Kaplan Any pair of sites (i,j) containing 4 gametes must admit a recombination event. Disjoint (non-overlapping) sites must contain distinct recombination events, which can be summed! This gives a lower bound on the number of recombination events. Based on simulations, this bound is not tight.

Myers and Griffiths’03: Idea 1 Let B(i,j) be a lower bound on the number of recombinations between sites i and j. 1=i1 i2 i3 i4 i5 i6 ik=n Can we compute maxP R(P) efficiently?

The Rm bound

Improved lower bounds The Rm bound also gives a general technique for combining local lower bounds into an overall lower bound. In the example, Rm=2, but we cannot give any ARG with 2 recombination events. Can we improve upon Hudson and Kaplan to get better local lower bounds? 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

Hudson and Kaplan: Idea 2 Consider the history of individuals. Let Ht denote the number of distinct halotypes at time t One of three things might happen at time t: Mutation: Ht increase by at most 1 Recombination: Ht increase by at most 1 Coalescence: Ht does not increase

The RH bound Ex: R>= 8-3-1=4 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1

RH bound In general, RH can be quite weak: consider the case when S>H However, it can be improved Partitioning idea: sum RH over disjoint intervals Apply to any subset of columns. Ex: Apply RH to the yellow columns 000000000000000 000000000000001 000000010000000 000000010000001 100000000000000 100000000000001 100000010000000 111111111111111 (BB’05)

The Rs bound Compute the minimum number of recombination events R in any ARG. Note that, we do not explicitly construct the ARG. Consider a matrix with M with H rows and S columns. The rows correspond to haplotypes. Columns correspond to sites.

Rs bound: Observation I Non-informative column: If a site contains at most one 1, or one 0, then in any history, it can be obtained by adding a mutation to a branch. EX: if a is the haplotype containing a 1, It can simply be added to the branch without increasing number of recombination events R(M) = R(M-{s}) : 1 a a b c

Redundant rows: If two rows h1 and h2 are identical, then Rs bound: Observation 2 Redundant rows: If two rows h1 and h2 are identical, then R(M) = R(M-{h1}) r1 r2 c

Rs bound: Observation 3 Suppose M has no non-informative columns, or redundant rows. Then, at least one of the haplotypes is a recombinant. There exists h s.t. R(M) = R(M-{h})+1 Which h should you choose?

Rs bound (Procedural) Procedure Compute_Rs(M) If  non-informative column s return (Compute_Rs(M-{s})) Else if  redundant row h return (Compute_Rs(M-{h})) Else return (1 + minh(Compute_Rs(M-{h}))

Results

Additional results/problems Using dynamic programming, Rs can be computed in 2^n poly(mn) time. Also, Rs can be augmented to handle intermediates. Are there poly. time lower bounds? The number of connected components in the conflict graph is a lower bound (BB’04). Fast algorithms for computing ARGs with minimum recombination. Poly. Time to get ARG with 0 recombination Poly. Time to get ARGs that are galled trees (Gusfield’03)

Underperforming lower bounds Sometimes, Rs can be quite weak An RI lower bound that uses intermediates can help (BB’05)

LPL data set 71 individuals, 9.7Kbp genomic sequence Rm=22, Rh=70