March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
March 2006Vineet Bafna Simulating population data Generate a coalescent (Topology + Branch lengths) For each branch length, drop mutations with rate Generate sequence data Note that the resulting sequence is a perfect phylogeny. Given such sequence data, can you reconstruct the coalescent tree? (Only the topology, not the branch lengths) Also, note that all pairs of positions are correlated (should have high LD).
March 2006Vineet Bafna Coalescent with Recombination An individual may have one parent, or 2 parents
March 2006Vineet Bafna ARG: Coalescent with recombination Given: mutation rate , recombination rate , population size 2N (diploid), sample size n. How can you generate the ARG (topology+branch lengths) efficiently? How will you generate sequences for n individuals? Given sequence data, can you reconstruct the ARG (topology)
March 2006Vineet Bafna Recombination Define r as the probability of recombining per generation. Assume k individuals in a generation. The following might happen: 1. An individual arises because of a recombination event between two individuals (It will have 2 parents). 2. Two individuals coalesce. 3. Neither (Each individual has a distinct parent). 4. Multiple events (low probability).
March 2006Vineet Bafna Recombination We ignore the case of multiple (> 1) events in one generation Pr (No recombination) = 1-kr Pr (No coalescence) Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA)
March 2006Vineet Bafna Let k = n, Define Iterate until k= 1 – Choose time from an exponential distribution with rate – Pick event as recombination with probability – If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce. – Update k, and continue ARG What is the flaw in this procedure?
March 2006Vineet Bafna Ancestral Recombination Graph
March 2006Vineet Bafna Simulating sequences on the ARG Generate topology and branch lengths as before For each recombination, generate a position. Next generate mutations at random on branch lengths – For a mutation, select a position as well. Generate Sequence data. – Program called ms (Hudson) is a commonly used coalescent simulator
March 2006Vineet Bafna Coalescent theory applications Coalescent simulations allow us to test various hypothesis. The coalescent/ARG is usually not inferred, unlike in phylogenies.
March 2006Vineet Bafna Coalescent theory: example Ex: ~1400bp at Sod locus in Dros. – 10 taxa – 5 were identical. The other 5 had 55 mutations. – Q: Is this a chance event, or is there selection for this haplotype.
March 2006Vineet Bafna Coalescent application – coalescent simulations were performed on 10 taxa. – 55 mutations on the coalescent branches – Count the number of times 5 lineages are identical – The event happened in 1.1% of the cases. – Conclusion: selection, or some other mechanism explains this data.
March 2006Vineet Bafna Coalescent example: Out of Africa hypothesis Looking at lineage specific mutations might help discard the candelabra model. How? How do we decide between the multi-regional and Out-of-Africa model? How do we decide if the ancestor was African?
March 2006Vineet Bafna Human Samples We look at data from human samples Gabriel et al. Science – 3 populations were sampled at multiple regions spanning the genome 54 regions (Average size 250Kb) SNP density 1 over 2Kb 90 Individuals from Nigeria (Yoruban) 93 Europeans 42 Asian 50 African American
March 2006Vineet Bafna Population specific recombination D’ was used as the measure between SNP pairs. SNP pairs were classified in one of the following – Strong LD – Strong evidence for recombination – Others (13% of cases) This roughly favors out-of- africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002
March 2006Vineet Bafna Haplotype Blocks A haplotype block is a region of low recombination. – Define a region as a block if less than 5% of the pairs show strong recombination Much of the genome is in blocks. Distribution of block sizes vary across populations.
March 2006Vineet Bafna Testing Out-of-Africa Generate simulations with and without migration. Check size of haplotype blocks. – Does it vary when migrations are allowed? – When the ‘new’ population has a bottleneck? If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’? – Should they be high frequency, or low frequency in African populations?
March 2006Vineet Bafna Haplotype Block: implications The genome is mostly partitioned into haplotype blocks. Within a block, there is extensive LD. – Is this good, or bad, for association mapping?
March 2006Vineet Bafna Coalescent reconstruction Reconstructing likely coalescents
March 2006Vineet Bafna Re-constructing history in the absence of recombination
March 2006Vineet Bafna An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. – It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.
March 2006Vineet Bafna Inclusion Property For any pair of columns i,j – i < j if and only if i 1 j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j
March 2006Vineet Bafna Example A B C D E r A BCDE Initially, there is a single clade r, and each node has r as its parent
March 2006Vineet Bafna Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E
March 2006Vineet Bafna Add first column In adding column i – Check each edge and decide which side you belong. – Finally add a node if you can resolve a clade r A B C D E A B C D E u
March 2006Vineet Bafna Adding other columns Add other columns on edges using the ordering property r E B C D A A B C D E
March 2006Vineet Bafna Unrooted case Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column. Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case. Homework: show that this is a correct algorithm
March 2006Vineet Bafna Population Sub-structure
March 2006Vineet Bafna Population sub-structure can increase LD Consider two populations that were isolated and evolving independently. They might have different allele frequencies in some regions. Pick two regions that are far apart (LD is very low, close to 0) Pop. A Pop. B p 1 =0.1 q 1 =0.9 P 11 =0.1 D=0.01 p 1 =0.9 q 1 =0.1 P 11 =0.1 D=0.01
March 2006Vineet Bafna Recent ad-mixing of population If the populations came together recently (Ex: African and European population), artificial LD might be created. D = 0.15 (instead of 0.01), increases 10-fold This spurious LD might lead to false associations Other genetic events can cause LD to arise, and one needs to be careful Pop. A+B p 1 =0.5 q 1 =0.5 P 11 =0.1 D= =0.15
March 2006Vineet Bafna Determining population sub-structure Given a mix of people, can you sub-divide them into ethnic populations. Turn the ‘problem’ of spurious LD into a clue. – Find markers that are too far apart to show LD – If they do show LD (correlation), that shows the existence of multiple populations. – Sub-divide them into populations so that LD disappears.
March 2006Vineet Bafna Determining Population sub-structure Same example as before: The two markers are too similar to show any LD, yet they do show LD. However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears
March 2006Vineet Bafna Iterative algorithm for population sub- structure Define N = number of individuals (each has a single chromosome) k = number of sub-populations. Z {1..k} N is a vector giving the sub-population. – Z i =k’ => individual i is assigned to population k’ X i,j = allelic value for individual i in position j P k,j,l = frequency of allele l at position j in population k
March 2006Vineet Bafna Example Ex: consider the following assignment P 1,1,0 = 0.9 P 2,1,0 =
March 2006Vineet Bafna Goal X is known. P, Z are unknown. The goal is to estimate Pr(P,Z|X) Various learning techniques can be employed. – max P,Z Pr(X|P,Z) (Max likelihood estimate) – max P,Z Pr(X|P,Z) Pr(P,Z) (MAP) – Sample P,Z from Pr(P,Z|X) Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version
March 2006Vineet Bafna Algorithm:Structure Iteratively estimate – (Z (0),P (0) ), (Z (1),P (1) ),.., (Z (m),P (m) ) After ‘convergence’, Z (m) is the answer. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m) from Pr(P | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m) ) How is this sampling done?
March 2006Vineet Bafna Example Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. Now, we need to sample P (1) from Pr(P | X, Z (0) ) Simply count N k,j,l = number of people in pouplation k which have allele l in position j p k,j,l = N k,j,l / N
March 2006Vineet Bafna Example N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N k,j,* N 1,1,0 = 4 N 1,1,1 = 6 p 1,1,0 = 4/10 p 1,2,0 = 4/10 Thus, we can sample P (m)
March 2006Vineet Bafna Sampling Z Pr[Z 1 = 1] = Pr[”01” belongs to population 1]? We know that each position should be in linkage equilibrium and independent. Pr[”01” |Population 1] = p 1,1,0 * p 1,2,1 =(4/10)*(6/10)=(0.24) Pr[”01” |Population 2] = p 2,1,0 * p 2,2,1 = (6/10)*(4/10)=0.24 Pr [Z 1 = 1] = 0.24/( ) = 0.5 Assuming, HWE, and LE
March 2006Vineet Bafna Sampling Suppose, during the iteration, there is a bias. Then, in the next step of sampling Z, we will do the right thing Pr[“01”| pop. 1] = p 1,1,0 * p 1,2,1 = 0.7*0.7 = 0.49 Pr[“01”| pop. 2] = p 2,1,0 * p 2,2,1 =0.3*0.3 = 0.09 Pr[Z 1 = 1] = 0.49/( ) = 0.85 Pr[Z 6 = 1] = 0.49/( ) = 0.85 Eventually all “01” will become 1 population, and all “10” will become a second population
March 2006Vineet Bafna Allowing for admixture Define q i,k as the fraction of individual i that originated from population k. Iteration – Guess Z (0) – For m = 1,2,.. Sample P (m),Q (m) from Pr(P,Q | X, Z (m-1) ) Sample Z (m) from Pr(Z | X, P (m),Q (m) )
March 2006Vineet Bafna Estimating Z (admixture case) Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q) i,1 i,2 j
March 2006Vineet Bafna Results on admixture prediction
March 2006Vineet Bafna Results: Thrush data For each individual, q(i) is plotted as the distance to the opposite side of the triangle. The assignment is reliable, and there is evidence of admixture.
March 2006Vineet Bafna Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania
March 2006Vineet Bafna Population sub-structure:research problem Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem: – (w/out admixture). Assign individuals to sub-populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations – (w/ admixture) Assign (individuals, loci) to sub-populations