March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner www.cse.ucsd.edu/classes/sp05/cse291.

Slides:



Advertisements
Similar presentations
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Advertisements

Exact Inference in Bayes Nets
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sampling distributions of alleles under models of neutral evolution.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
MALD Mapping by Admixture Linkage Disequilibrium.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Probabilistic methods for phylogenetic trees (Part 2)
Phylogenetic trees Sushmita Roy BMI/CS 576
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Simon Myers, Gil McVean Department of Statistics, Oxford Recombination and genetic variation – models and inference.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
INTRODUCTION TO ASSOCIATION MAPPING
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Wi’08Structure Population sub-structure. Wi’08Structure Projects Harish/Nitin Gaurav (Tuesday) Stefano/Hossein (Tuesday) Nisha/Yu David Jian/Josue (Tuesday)
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Lecture 11: Linkage Analysis IV Date: 10/01/02  linkage grouping  locus ordering  confidence in locus ordering.
1,3, ,
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Equilibria in populations
Constrained Hidden Markov Models for Population-based Haplotyping
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
CSE 280A: Advanced Topics in Computational Molecular Biology
L4: Counting Recombination events
Estimating Recombination Rates
Vineet Bafna/Pavel Pevzner
The coalescent with recombination (Chapter 5, Part 1)
Trees & Topologies Chapter 3, Part 2
Outline Cancer Progression Models
Goals: To identify subpopulations (subsets of the sample with distinct allele frequencies) To assign individuals (probabilistically) to subpopulations.
Presentation transcript:

March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006Vineet Bafna Simulating population data Generate a coalescent (Topology + Branch lengths) For each branch length, drop mutations with rate  Generate sequence data Note that the resulting sequence is a perfect phylogeny. Given such sequence data, can you reconstruct the coalescent tree? (Only the topology, not the branch lengths) Also, note that all pairs of positions are correlated (should have high LD).

March 2006Vineet Bafna Coalescent with Recombination An individual may have one parent, or 2 parents

March 2006Vineet Bafna ARG: Coalescent with recombination Given: mutation rate , recombination rate , population size 2N (diploid), sample size n. How can you generate the ARG (topology+branch lengths) efficiently? How will you generate sequences for n individuals? Given sequence data, can you reconstruct the ARG (topology)

March 2006Vineet Bafna Recombination Define r as the probability of recombining per generation. Assume k individuals in a generation. The following might happen: 1. An individual arises because of a recombination event between two individuals (It will have 2 parents). 2. Two individuals coalesce. 3. Neither (Each individual has a distinct parent). 4. Multiple events (low probability).

March 2006Vineet Bafna Recombination We ignore the case of multiple (> 1) events in one generation Pr (No recombination) = 1-kr Pr (No coalescence) Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA)

March 2006Vineet Bafna Let k = n, Define Iterate until k= 1 – Choose time from an exponential distribution with rate – Pick event as recombination with probability – If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce. – Update k, and continue ARG What is the flaw in this procedure?

March 2006Vineet Bafna Ancestral Recombination Graph

March 2006Vineet Bafna Simulating sequences on the ARG Generate topology and branch lengths as before For each recombination, generate a position. Next generate mutations at random on branch lengths – For a mutation, select a position as well. Generate Sequence data. – Program called ms (Hudson) is a commonly used coalescent simulator

March 2006Vineet Bafna Coalescent theory applications Coalescent simulations allow us to test various hypothesis. The coalescent/ARG is usually not inferred, unlike in phylogenies.

March 2006Vineet Bafna Coalescent theory: example Ex: ~1400bp at Sod locus in Dros. – 10 taxa – 5 were identical. The other 5 had 55 mutations. – Q: Is this a chance event, or is there selection for this haplotype.

March 2006Vineet Bafna Coalescent application – coalescent simulations were performed on 10 taxa. – 55 mutations on the coalescent branches – Count the number of times 5 lineages are identical – The event happened in 1.1% of the cases. – Conclusion: selection, or some other mechanism explains this data.

March 2006Vineet Bafna Coalescent example: Out of Africa hypothesis Looking at lineage specific mutations might help discard the candelabra model. How? How do we decide between the multi-regional and Out-of-Africa model? How do we decide if the ancestor was African?

March 2006Vineet Bafna Human Samples We look at data from human samples Gabriel et al. Science – 3 populations were sampled at multiple regions spanning the genome 54 regions (Average size 250Kb) SNP density 1 over 2Kb 90 Individuals from Nigeria (Yoruban) 93 Europeans 42 Asian 50 African American

March 2006Vineet Bafna Population specific recombination D’ was used as the measure between SNP pairs. SNP pairs were classified in one of the following – Strong LD – Strong evidence for recombination – Others (13% of cases) This roughly favors out-of- africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002

March 2006Vineet Bafna Haplotype Blocks A haplotype block is a region of low recombination. – Define a region as a block if less than 5% of the pairs show strong recombination Much of the genome is in blocks. Distribution of block sizes vary across populations.

March 2006Vineet Bafna Testing Out-of-Africa Generate simulations with and without migration. Check size of haplotype blocks. – Does it vary when migrations are allowed? – When the ‘new’ population has a bottleneck? If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’? – Should they be high frequency, or low frequency in African populations?

March 2006Vineet Bafna Haplotype Block: implications The genome is mostly partitioned into haplotype blocks. Within a block, there is extensive LD. – Is this good, or bad, for association mapping?

March 2006Vineet Bafna Coalescent reconstruction Reconstructing likely coalescents

March 2006Vineet Bafna Re-constructing history in the absence of recombination

March 2006Vineet Bafna An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. – It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

March 2006Vineet Bafna Inclusion Property For any pair of columns i,j – i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

March 2006Vineet Bafna Example A B C D E r A BCDE Initially, there is a single clade r, and each node has r as its parent

March 2006Vineet Bafna Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E

March 2006Vineet Bafna Add first column In adding column i – Check each edge and decide which side you belong. – Finally add a node if you can resolve a clade r A B C D E A B C D E u

March 2006Vineet Bafna Adding other columns Add other columns on edges using the ordering property r E B C D A A B C D E

March 2006Vineet Bafna Unrooted case Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column. Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case. Homework: show that this is a correct algorithm

March 2006Vineet Bafna Population Sub-structure

March 2006Vineet Bafna Population sub-structure can increase LD Consider two populations that were isolated and evolving independently. They might have different allele frequencies in some regions. Pick two regions that are far apart (LD is very low, close to 0) Pop. A Pop. B p 1 =0.1 q 1 =0.9 P 11 =0.1 D=0.01 p 1 =0.9 q 1 =0.1 P 11 =0.1 D=0.01

March 2006Vineet Bafna Recent ad-mixing of population If the populations came together recently (Ex: African and European population), artificial LD might be created. D = 0.15 (instead of 0.01), increases 10-fold This spurious LD might lead to false associations Other genetic events can cause LD to arise, and one needs to be careful Pop. A+B p 1 =0.5 q 1 =0.5 P 11 =0.1 D= =0.15

March 2006Vineet Bafna Determining population sub-structure Given a mix of people, can you sub-divide them into ethnic populations. Turn the ‘problem’ of spurious LD into a clue. – Find markers that are too far apart to show LD – If they do show LD (correlation), that shows the existence of multiple populations. – Sub-divide them into populations so that LD disappears.

March 2006Vineet Bafna Determining Population sub-structure Same example as before: The two markers are too similar to show any LD, yet they do show LD. However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears

March 2006Vineet Bafna Iterative algorithm for population sub- structure Define N = number of individuals (each has a single chromosome) k = number of sub-populations. Z  {1..k} N is a vector giving the sub-population. – Z i =k’ => individual i is assigned to population k’ X i,j = allelic value for individual i in position j P k,j,l = frequency of allele l at position j in population k

March 2006Vineet Bafna Example Ex: consider the following assignment P 1,1,0 = 0.9 P 2,1,0 =

March 2006Vineet Bafna Goal X is known. P, Z are unknown. The goal is to estimate Pr(P,Z|X) Various learning techniques can be employed. – max P,Z Pr(X|P,Z) (Max likelihood estimate) – max P,Z Pr(X|P,Z) Pr(P,Z) (MAP) – Sample P,Z from Pr(P,Z|X) Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version

March 2006Vineet Bafna Algorithm:Structure Iteratively estimate – (Z (0),P (0) ), (Z (1),P (1) ),.., (Z (m),P (m) ) After ‘convergence’, Z (m) is the answer. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m) from Pr(P | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m) ) How is this sampling done?

March 2006Vineet Bafna Example Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. Now, we need to sample P (1) from Pr(P | X, Z (0) ) Simply count N k,j,l = number of people in pouplation k which have allele l in position j p k,j,l = N k,j,l / N

March 2006Vineet Bafna Example N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N k,j,* N 1,1,0 = 4 N 1,1,1 = 6 p 1,1,0 = 4/10 p 1,2,0 = 4/10 Thus, we can sample P (m)

March 2006Vineet Bafna Sampling Z Pr[Z 1 = 1] = Pr[”01” belongs to population 1]? We know that each position should be in linkage equilibrium and independent. Pr[”01” |Population 1] = p 1,1,0 * p 1,2,1 =(4/10)*(6/10)=(0.24) Pr[”01” |Population 2] = p 2,1,0 * p 2,2,1 = (6/10)*(4/10)=0.24 Pr [Z 1 = 1] = 0.24/( ) = 0.5 Assuming, HWE, and LE

March 2006Vineet Bafna Sampling Suppose, during the iteration, there is a bias. Then, in the next step of sampling Z, we will do the right thing Pr[“01”| pop. 1] = p 1,1,0 * p 1,2,1 = 0.7*0.7 = 0.49 Pr[“01”| pop. 2] = p 2,1,0 * p 2,2,1 =0.3*0.3 = 0.09 Pr[Z 1 = 1] = 0.49/( ) = 0.85 Pr[Z 6 = 1] = 0.49/( ) = 0.85 Eventually all “01” will become 1 population, and all “10” will become a second population

March 2006Vineet Bafna Allowing for admixture Define q i,k as the fraction of individual i that originated from population k. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m),Q (m) from Pr(P,Q | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m),Q (m) )

March 2006Vineet Bafna Estimating Z (admixture case) Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q) i,1 i,2 j

March 2006Vineet Bafna Results on admixture prediction

March 2006Vineet Bafna Results: Thrush data For each individual, q(i) is plotted as the distance to the opposite side of the triangle. The assignment is reliable, and there is evidence of admixture.

March 2006Vineet Bafna Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

March 2006Vineet Bafna Population sub-structure:research problem Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem: – (w/out admixture). Assign individuals to sub-populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations – (w/ admixture) Assign (individuals, loci) to sub-populations