Lecture 3: population genetics I: mutation and recombination Genome evolution Lecture 3: population genetics I: mutation and recombination
Population genetics Drift: The process by which allele frequencies are changing through generations Mutation: The process by which new alleles are being introduced Recombination: the process by which multi-allelic genomes are mixed Selection: the effect of fitness on the dynamics of allele drift Epistasis: the effects of fitness dependencies among different alleles “Organismal” effects: Ecology, Geography, Behavior
Wright-Fischer model for genetic drift individuals ∞ gametes N individuals ∞ gametes We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0) We can model the frequency as a Markov process on a variable X (the number of A alleles) with transition probabilities: Sampling j alleles from a population 2N population with i alleles. In larger population the frequency would change more slowly (the variance of the binomial variable is pq/2N – so sampling wouldn’t change that much) Loss 1 2N-1 2N Fixation
Mutations vs Drift Diversity (q)= chance of having same genotype on two random individuals Mutations are generating population diversity Drift is eliminating population’s diversity through fixation Mutation is happening is some biologically dependent rate m (more on that later in the course) Fixation is happening in a rate of ~4N generation How will the population look like given both forces?
Stationary distribution when drift is dominating If mutations is slow compared to drift, we can model the population as a single random variable. Then evolution is a Markov process on two or more states of that variables Simplest model: assume two alleles, and mutations probabilities: If the process is running long enough, we will converge to a stationary distribution: A a Remember – under these assumption, we are likely to sample the entire population at either A or a state. Think what conditions on the mutation rate can justify this model?
What happen when mutations are rapid? If mutations is rapid compared to drift, we lose all population structure This is just a random mixing process Evolution cannot work in this way – information must be propagated In practice, population maintain a non-trivial balance between mutation and drift But we do not know the mutation rate (or the effective population size)
A coalescent model approach: Infinite alleles model When alleles where measure at the protein levels, it was reasonable to assume mutations were generating new variants (isozymes) – never reversing or repeating a variants Adding mutations with probability m, the coalescent process is extended by killing lineages (time is speeded up by a 2N factor): Coalescent: mutation: Back in time “Coalescent with killing”
Hoppe’s Urn Probability model (Hoppe’s Urn): Selecting from an urn with one black ball of mass q and more balls with other colors and mass 1. Each time the black ball is selected, a new ball with a new color is added to the urn. If another color is selected, the selected ball and another ball from the same color are returned to the urn. Theorem: Hoppe’s Urn and the Coalescent with killing are equivalent Probability = 1/(n+q) Probability = q/(n+q) (The Chinese restaurant process)
Testing the infinite alleles model Theorem (Ewens sampling formula): Let ai be the number of alleles present i times in a sample of size n. When the scaled mutation rate is q=4Nm, A simplified statistics is the number of distinct alleles. This should have the expected value: Proof: At each step of the Hoppe’s process, we draw the black ball with probability:
Testing the infinite alleles model Not quite neutral Highly non neutral Figure 7.16,7.17 VNTR locus in humans: observed (open columns) and Ewens predicted allele counts. F computed from the number of Xdh alleles in 89 D. pseudoobscura lines gene: 52 had a common allele, 8 singletons. Compared to a simulation assuming the infinite allele model.
Infinite sites model In the infinite sites model, mutations occur at distinct sites, exactly once. This model is appropriate for long DNA sequences Theorem: Let m be the mutation rate for a locus under consideration, and set q=4Nm. Under the infinite sites model, the expected number of segregating sites is: Proof: Let tj be the amount of time in the coalescent during which there are j lineages. We showed earlier that tj has approximately an exponential distribution with mean 2/(j(j-1)). The total amount of time in the tree for a sample size n is: Mutations occur at rate 2Nm:
Infinite sites model Theorem: q=4Nm. Under the infinite sites model, the number of segregating sites Sn has Proof: Let sj be the number of segregating sites created when there were j lineages. While there are j lineages, we may get mutations at rate 2Nmj, and coalescence at rate j(j-1)/2. Mutations occur before coalescence with probability: k successes: It’s a shifted geometric distribution:
Watterson’s estimator, using the infinite site model We can estimate q=4Nm from an empirical Sn Theorem: For the Watterson’s estimator So we can build a model of the population from as little data as S What will happen if we want to incorporate more complex models? (e.g., expansion, migration?)
Finite alleles model If we think of a single DNA base, we only have 4 possible alleles Our model must the include recurrent mutations A G T C Even if we assume neutrality, our mutations can be come dependent We may have different rates at different sites We may have coupling of one base and the bases nearby We may need to consider insertions and deletions Importantly, if all these are neutral, then the basic coalescent structure is not affected The Poission process: Expected = lt
Using simulations The sampling procedure: Generate a large number of populations (using the model we presented) Compute the distribution of your statistics on this random case Compare it to the value you observe in your population if you find a significant bias, some modeling assumption must be wrong In principle, we can sample generation after generation, for sufficient time (how much?) Direct simulation using Wright-Fischer is painfully expensive (why?) If you are only interested in the current population, most of your coin tossing will be useless We can use the coalescent approach and just sample genealogies, going back in time For example, using the coalescent with killing Important: this is analogous to first sample a tree and then scatter the mutations there We can also think of simulation evolution while ignoring the population, based on the Markov process shown above (what are the limitations here?)
Recombination and linkage Assume two loci have alleles A1,A2, B1,B2 Only double Heterozygous can allow recombination to change allele frequencies: Linkage equilibrium: A1 B1 A2 B2 A1B1/ A2B2 A1 B2 A1B2/ A1B2 A2 B1 The recombination fraction r: proportion of recombinant gametes generated from double heterozygote For different chromosomes: r = 0.5 For the same chromosome, function of the distance and possibly other factors
Linkage disequilibrium (LD) A2 B1 A1 B1 A2 B2 A1 B2 r 1-r Recombination on any A1- / -B1 No recomb Next generation: Define the linkage disequilibrium parameter D as: D r=0.05 r=0.5 r=0.2 Generation
Linkage disequilibrium (LD) - example blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg For M/N – p1 = 0.5425 p2 = 0.4575 For S/s – q1 = 0.3080 q2 = 0.6920 Observed unlinked 334.2 484 MS 750.8 611 Ms 281.8 142 NS 633.2 773 Ns Linkage equilibrium highly unlikely!
Sources of Linkage disequilibrium LD in original population that was not stabilized due to low r Genetic coadaptation: regions of the genome that are not subject to recombination (for example, inverted chromosomal fragments) Admixture of populations with different allele frequencies:
Recombination rates in the human population: LD blocks
Recombination rates in the human population Recombination rates are highly non uniform – with major effects on genome structure!
Selection Fitness: the relative reproductive success of an individual (or genome) Fitness is only defined with respect to the current population. Fitness is unlikely to remain constant in all conditions and environments Sampling probability is multiplied by a selection factor 1+s Mutations can change fitness A deleterious mutation decrease fitness. It would therefore be selected against. This process is called negative or purifying selection. A advantageous or beneficial mutation increase fitness. It would therefore be subject to positive selection. A neutral mutation is one that do not change the fitness.
The Moran model Instead of working with discrete generation, we replace at most one individual at each time step A A A Replace by sampling from the current population a a X A A A a a a A A A A A A We assume time steps are small, what kind of mathematical models is describing the process?
Continuous time Markov processes Conditions on transitions: Markov Kolmogorov Theorem: exists (may be infinite) exists and finite
Rates and transition probabilities The process’s rate matrix: Transitions differential equations (backward form):
The Moran model Replace by sampling from the current X population Assume the rate of replacement for each individual is 1, We derive a model similar to Wright-Fischer, but in continuous time. A process on a random variable counting the number of allele A: Loss 1 i-1 i i+1 2N-1 2N Fixation “Birth” Rates: “Death”
Fixation probability Loss Fixation “Birth” Rates: “Death” 1 i-1 i i+1 2N-1 2N Fixation “Birth” Rates: “Death” In fact, in the limit, the Moran model converge to the Wright-Fischer model, for example: Theorem: When going backward in time, the Moran model generate the same distribution of genealogy as Wright-Fischer, only that the time is twice as fast Theorem: In the Moran model, the probability that A becomes fixed when there are initially I copies is i/2N Proof: like the proof for the Wright-Fischer model. The expected X value is unchanged since the probability of births and deaths is the same
Fixation time Expected fixation time assuming fixation Theorem: In the Moran model, let p = i / 2N, then: Proof: not here..