The coalescent with recombination (Chapter 5, Part 1)
Six Assumptions of Wright-Fisher Model Discrete and non-overlapping generations Haploid individuals or two subpopulations The population size is constant All individuals are equally fit The population has no geographical or social structure The genes are not recombining No need to be relaxed Have been relaxed in Chapter 4 To be relaxed soon 2019/1/13 Comp 790-Coalescent with recombination
No recombination: the last assumption The last assumption that needs to be relaxed. Why does it need? Recombination occurs in most of the real data sets. Why is it the last one to be relaxed? More mathematically complex in analysis The sequence samples are no longer related by a tree, but a graph or a collection of trees. 2019/1/13 Comp 790-Coalescent with recombination
Comp 790-Coalescent with recombination Outline What is recombination? An example of recombination Hudson’s model of recombination Wright-Fisher model with recombination ARG Simulation Algorithm 2019/1/13 Comp 790-Coalescent with recombination
Comp 790-Coalescent with recombination What is recombination? Recall the slides in lecture 5. Recombination A process in which new gene combinations are introduced Eg. Crossover, Gene-conversion 2019/1/13 Comp 790-Coalescent with recombination
What is the result of recombination? No recombination Recombination Grandparents Layer Parents Layer Recombination Children Layer 2019/1/13 Comp 790-Coalescent with recombination
An example of recombination The Apolipoprotein E gene 31 different haplotypes (rows) 21 segregating sites (columns) Some pairs of sites cannot be fitted on a single tree. There must be recombination. 2019/1/13 Comp 790-Coalescent with recombination
Comp 790-Coalescent with recombination Pair-wise LD measure LD is a indirect measure of the correlation of genealogical trees for different segregating sites. The higher LD, the more correlated the pair of sites The color denotes the significance There is a weak tendency that highly significant LD is found for close sites. 2019/1/13 Comp 790-Coalescent with recombination
LD on different distance LD is smaller the further apart the sites are. Recombination leads to these pattern. Sites far apart experience more recombination events. 2019/1/13 Comp 790-Coalescent with recombination
A summary of the example We cannot use previous model without recombination to fit these sequences. Recombination is the cause. Recombination can generate incompatibilities between pairs of sites. Segregation sites far apart experience more recombination events, so they become less correlated. 2019/1/13 Comp 790-Coalescent with recombination
Hudson’s model of recombination Forward perspective: Parental chromosome is directly inherited from grandparental chromosomes Choose a random point uniformly Copy the genetic material from Chromosome A to the left of that point Copy the genetic material from Chromosome B to the right of that point. A B Recombination 2019/1/13 Comp 790-Coalescent with recombination
Hudson’s model of recombination (cont.) Reversed: Choose a chromosome from a parent The chromosome splits to two grandparental chromosomes Recombination 2019/1/13 Comp 790-Coalescent with recombination
Modeling recombination and coalescence Recombination events are the opposite of coalescent events. Looking backwards Coalescence is a combining event. Recombination is a splitting event. But how can we model both of these events? Use a similar idea we did before (in adding mutation events to coalescence). Question 1:What is this idea? 2019/1/13 Comp 790-Coalescent with recombination
Another exponential distribution We model the waiting time of recombination events to be an exponential distribution. This distribution is independent of the coalescent process. The parameter (or the intensity of recombination) depends on the recombination rate(ρ) in a sequence, times the number of ancestral lineages. 2019/1/13 Comp 790-Coalescent with recombination
From Hudson’s model to Wright-Fisher model Hudson’s model simplifies recombination process in terms of the biological facts. The mechanisms of recombination are very different and complicated in eukaryotes, bacteria, and viruses. The process is still not very well understood at the molecular level. But still, it forms the basis for most applications of coalescent theory to recombining sequences. Now we modify Wright-Fisher model to include this kind of simplified model of recombination. 2019/1/13 Comp 790-Coalescent with recombination
Wright-Fisher model with recombination Diploid Wright-Fisher Model An individual perspective 2019/1/13 Comp 790-Coalescent with recombination
Wright-Fisher model with recombination (cont.) Haploid Wright-Fisher Model We can ignore the existence of individuals under some conditions. A sequence perspective 2019/1/13 Comp 790-Coalescent with recombination
Discrete time formulation In discrete model, let r be the recombination rate. TR denotes the number of generations until the first recombination event. The probability that a sequence was created by recombination in j generation is TR is geometrically distributed. 2019/1/13 Comp 790-Coalescent with recombination
Continuous time approximation Let the scaled recombination rate ρ=4Nr, similar to θ in mutation. J=2Nt is exponentially distributed. Note that the probability until now is for only one sequence 2019/1/13 Comp 790-Coalescent with recombination
Continuous time approximation (cont.) If there are k sequences, the parameter of the exponential distribution will be kρ/2 Question 2: Why? The waiting times for recombination events of every sequences are exponentially distributed ( i.e. Exp(ρ/2) ) and are independent. The intensity of recombination in any of the k sequences equals the sum of the intensity in each sequence. 2019/1/13 Comp 790-Coalescent with recombination
Continuous time approximation (cont.) Again, both coalescence event or recombination event in k sequences are independent and exponentially distributed. The waiting time of one of these events occurs will be Exp( ) The probability that the first event is a coalescence is The probability that it is a recombination is 2019/1/13 Comp 790-Coalescent with recombination
ARG Simulation algorithm 1. Start with k = n genes. 2. For k sequences with ancestral material, draw a random number from the exponential distribution with parameter k(k − 1)/2 + kρ/2. This is the time to the next event. 3. With probability (k − 1)/(k − 1 + ρ) the event is a coalescence event, otherwise it is a recombination event. 4. If it is a coalescence event choose two sequences among ancestral sequences at random and merge them into one sequence inheriting the ancestral material to both of the sequences. Decrease k by one. If k = 1 end the process, otherwise go to 1. 2019/1/13 Comp 790-Coalescent with recombination
ARG Simulation algorithm (cont.) 5. If it is recombination, draw a random sequence and a random point on the sequence. Create an ancestor sequence with the ancestral material to the left of the chosen point and a second ancestor with the ancestral material to the right of the recombination point. Increase the number of ancestral sequences k by one and go to 1. Question 3: Where can we find the missing material of the ancestors? Splitting A random point 2019/1/13 Comp 790-Coalescent with recombination
Is the single ancestor ever reached? A coalescence event decreases k by one. A recombination event increases k by one. Question 4: Is there an end for the process? YES! Why? It is a birth-death process. The coalescent intensity is k(k-1)/2 [birth rate] The recombination intensity is kρ/2 [death rate] k(k-1)/2 >= kρ/2 GMRCA is always found. But it may be a LONG time. 2019/1/13 Comp 790-Coalescent with recombination
Genealogical structure: From tree to graph With recombination, we must use a graph to model the sequence relations rather than a tree. ARG (Ancestral Recombination Graph) The graph resulting from the algorithm 2019/1/13 Comp 790-Coalescent with recombination
Genealogical structure: From graph to a collection of trees However, if we focus on a single point on the sequence, there will be no recombination! Question 5: Why? The point of child sequence is always inherited from only one parent sequence. Local tree The tree relating the sequences in a single position The genealogy graph can be seen as a collection of local trees, one for each position. 2019/1/13 Comp 790-Coalescent with recombination
Comp 790-Coalescent with recombination Next time More on simulation algorithm Effect of a single recombination event Coalescent events with gene conversion 2019/1/13 Comp 790-Coalescent with recombination