Detection of domestication genes and other loci under selection.

Detection of domestication genes and other loci under selection

Search for Genes that experienced artificial (and natural) selection Akin in sprit to testing candidate genes for association or using genome scans to find QTLs. In linkage studies: Use molecular markers to look for marker-trait associations (phenotypes) In tests for selection, use molecular markers to look for patterns of selection (patterns of within- and between-species variation)

Types of Genes that have experienced selection in crop/animal species Domestication genes: Alleles fixed in the course of the initial domestication Diversification/Improvement genes: Alleles fixed in the course of improvement following domestication. Adaptation genes: Alleles in natural populations responding to natural selection on environmental conditions (candidates to transfer into elite germplasms).

The general approaches for using sequence data to search for signs of selection Tests based on pattern and amount of within- species polymorphism (departures from neutral predictions). On-going or recent selection Tests based on polymorphism plus between species divergence. On-going or recent selection Tests based on phylogenetic comparisons between species. Historical selection (won’t discuss these further) Key: Use of features of variation at a marker locus to test for departures from strict neutrality

A quick review of the neutral theory (expected patterns of variation under drift) Drift and the coalescence process (its about time) Mutation-drift equilibrium (within-population variation). Function of population size and mutation rate. Expected variation = H = 4N e  Divergence between populations (between- population variation). Function of time and mutation rate (but not population size), d = 2t 

Mutation-Drift Equilibrium (Single Loci) Drift removes variation, while mutation introduces it. Thus, an equilibrium amount of genetic variance results While alleles change over time, heterozygosity remains roughly constant.

A very powerful way of thinking about drift is the Coalescent Process Instead of following alleles, think in terms of lineages. As a consequence of drift, eventually all current copies of alleles trace back to a single ancestral lineage. Hence, the current lineages coalesce as one moves back in time

From coalescent theory, the expected time back to the MRCA is 2N generations Hence, for two randomly-chosen sequences, the expected number of mutations they differ by is just 2  t = 2  (2N) = 4N  If 4N  >> 1, two random sequences will typically differ (and hence be heterozygotes) If 4N  << 1, two random sequences will typically differ (and hence be homozygotes)

Divergence Between Populations Mutation and drift also generate a between- line variance, i.e., a population divergence As lines separate, the initial heterozygosity is randomly partitioned, creating a between-line variance. More importantly, as new mutations arise in the separated lines, some of these are fixed by drift, and this drives a constant divergence between populations

One average, for a population of size N, 2N  mutations arise each generation For any of these, their probability of fixation is just U(1/[2N]) = 1/(2N) Hence, the rate at which new mutations are fixed within a line is just (# new per generation)*Pr(fixation) 2N  1/(2N) =  Hence, divergence d(t) after t generations is just d(t) =  t Independent of population size!

The major results from mutation-drift equilibrium Within-population variation: 4N e u Rate of divergence/generation: u Between-population variation: 2tu

Logic behind polymorphism-based tests Key: Time to MRCA relative to drift If a locus is under positive selection, more recent MRCA (shorter coalescent) If a locus is under balancing selection, older MRCA relative to drift (deeper coalescent) Shorter coalescent = lower levels of variation, longer blocks of disequilibrium Deeper coalescent = higher levels of variation, shorter blocks of disequilibrium

Balancing selection Selective Sweep Neutral Time Present Past Longer time back to MRCA Shorter time back to MRCA Selection changes to coalescent times Time to MRCA for the individuals sampled

Selective sweeps result in a local decrease in N e around the selective site This results in a shorter time to MRCA and a decrease in the amount of polymorphism Note that this has no effect on the rate of divergence of neutral sites, as this is independent on N e. Conversely, balancing selection increases the effective population size, increasing the amount of polymorphism

A scan of levels of polymorphism can thus suggest sites under selection Directional selection (selective sweep) Balancing selection Local region with reduced mutation rate Local region with elevated mutation rate Map location Variation

Example: maize domestication gene tb1 Major changes in plant architecture in transition from teosinte to maize Doebley lab identified a gene, teosinite branched 1, tb1, involved in many of these architectural changes Wang et al. (1999) observed a significant decrease in genetic variation in the 5’ NTR region of tb1, suggesting a selective sweep influenced this region. The sweep did not influence the coding region.

Wang et al (1999) Nature 398: 236.

Clark et al (2004) examined the 5’ tb1 region in more detail, finding evidence for a sweep influencing a region of 60 - 90 kb Clark et al (2004) PNAS 101: 700.

Wang et al. and Clark et al. controlled for the reduction in neutral polymorphisms being due simply to reduced mutation rate by using a close relative (teosinte) as a control. The process of domestication itself is expected to reduce variation genome-wide because of the population bottleneck that is typically induced during domestication. In maize, the background level of polymorphism (genome wide) is only about 75% of that of teosinte.

Estimating strength of selection from size of sweep region Kaplan, Hudson, and Langley (1989) showed that the distance d at which a neutral site can be influenced by a sweep is a function of the strength of selection s and the recombination fraction c, with d ~ 0.01 s/c. For tb1, s -> 0.05. Hence, s = 100. d. c With s in hand, one can also estimate the expected time for selection to fix the allele, which Wang et al. estimated at 300 to 1000 years, indicating a fairly long period of domestication.

“Sticky” (glutinous) rice results from low amylose levels, and are typical of temperate japonica variety groups. Example: Waxy gene in Rice (Olsen et al. 2006) A number of groups showed this is due to a splice mutant in the Waxy gene. This is an example of an improvement (as opposed to domestication) gene Olsen et al. observed a region 250kb in size around Waxy with a greatly reduced level of polymorphism compared to control populations. Using the Kaplan et al expression, this gives s = 4.6 !

While the sweep around tb1 did not even influence the coding region of that gene, the Waxy sweep covers 39 rice genes! One evolutionary consequence of a sweep is that the reduction in population size (that produces the signal of a sweep) also reduces the efficiency of selection on linked genes within the region (the Hill-Robertson effect) Deleterious alleles have a higher probability of fixation Favorable alleles have a reduced probability of fixation.

Accumulation of Deleterious mutations in domesticated rice genomes? Lu et al (2006) compared the genomes of Oryza sativa ssp. indica and japonica with their ancestral relative O. rufipogon. The K a /K s (ratio of the substitution rate of non-synonymous to synonymous changes) was much higher for indica vs. japonica (0.498) than for domesticated vs. wild rice (japonica vs. rufipogin, 0.259) Lu et al suggest that roughly 25% of the amino acid differences between indica and japonica are deleterious. They suggest that excessive reductions in Ne due to selective- sweeps covering much of the genome during selection for domestication greatly reduced the efficiency of natural selection in removing deleterious alleles.

Formal tests of selection Tajima’s D. Requires: single-locus, within-population polymorphism data McDonald-Kreitman Test. Requires: coding region, data from 2 species (within- population variation, btw species divergence) Hudson-Kreitman-Aguade (HKA) test. Requires: at least two loci, data from 2 species (within-population variation, btw species divergence) Allele frequency vs. LD tests. Requires: dense marker scan around a single-locus using within-population data

Tests based on Within-Population Variation Two sequence evolution frameworks are typically used: infinite alleles vs. infinite sites. These tend to compare different measures of variation (such as number of alleles vs. pair-wise distances among alleles) Both assume each new mutation generates a new (unique) sequence. (such is not the case for STRs) How do these frameworks differ?

A A G A C C A A G G C C A A G A C C A A G G C C A A G G C A Consider the following five sequences Infinite alleles: Treat each different haplotype as a different allele (look at rows) Here, there are three alleles 1 1 2 2 3 Infinite sites model: Treat each site (base position) separately. How many polymorphic sites are there? (look over columns) Here, 2 polymorphic sites * *

Two typical classes of departures are seen with polymorphism data 2: An excess of intermediate frequency alleles, a deficiency of rare alleles (alleles older than expected) 1: An excess of rare alleles, a deficiency of intermediate frequency alleles (alleles younger than expected) Pattern 1 expected under a selective sweep, when coalescent times are shorter than expected Pattern 2 expected under balancing selection, when coalescent times are longer than expected

Summary Statistics for Infinite Sites Model The key parameter is  = 4N e  S, number of segregating sites. E(S) = a n  k, average number of pairwise differences. E(k) =  , number of singletons. E(  ) =   n/(n-1) X a n = n°1 i=1 1 i b  S = S a n ; b  k =k; b  ¥ = n°1 n ¥ These suggest the following three estimates for  :

Tajima’s D test One of the first, and most popular, polymorphism tests was Tajima’s D test (Tajima 1989) D contrasts estimates of  based on S vs. k Idea: For S we simply count sites, independent of their frequencies. Hence, S rather sensitive to changes in the frequency of rare alleles. D= b  k ° b  S p Æ D S+Ø D S 2

On the other hand, k is a more frequency- weighted measure, and hence more sensitive to changes in the frequency of intermediate alleles. D < 0: too many rare alleles. Selective sweep or population expansion. MRCA more recent than expected. D > 0: too many intermediate-frequency alleles. Balancing selection or population subdivision. MRCA more ancient than expected.

D is a test whether the amount of polymorphism is consistent with the number of polymorphisms Under selective sweeps/population expansion, heterozygosity should be significantly less than predicted from number of polymorphisms

Major Complication With Polymorphism-based tests Demographic factors can also cause these departures from neutral expectations! Too many young alleles -> recent population expansion Too many old alleles -> population substructure Thus, there is a composite alternative hypothesis, so that rejection of the null does not imply selection. Rather, selection is just one option.

Can we overcome this problem? It is an important one, as only polymorphism- based tests can indicate on-going selection Solution: demographic events should leave a constant signature across the genome Essentially, all loci experience common demographic factors Genome scan approach: look at a large number of markers. These generate null distribution (most not under selection), outliers = potentially selected loci (genome wide polymorphism tests)

Joint Polymorphism-Divergence tests Under the neutral theory, heterozygosity is a function of  = 4N e , while divergence is a function of  t Joint Polymorphism-Divergence tests use these two different expectations to look for Concordance with neutral results. For example, under neutrality, levels of Polymorphism and divergence should be positively correlated.

Under neutrality, the ratio of polymorphism to divergence at the i-th locus is just Hence, for a series of neutral loci compared in the same populations, this ratio should be very similar. H i d i = 4N e π i 2tπ i = 2N e t The very popular Hudson, Kreitman and Aguade (1987), or HKA test, is based on this idea, with one using a series of controlled (neutral) loci to contrast with the locus of interest.

McDonald-Kreitman Test d syn d rep = 2tπ syn 2tπ rep = π syn π rep H syn H rep = 4N e π syn 4N e π rep = π syn π rep One of the most straight-forward tests of selection that jointly uses divergence and polymorphism data was proposed by McDonald and Kreitman (1991) Consider the replacement & synonymous sites at a single locus. These ratios have the same expected value

Since these ratios have the same expected value, the McDonald-Kreitman test proceeds via a simple contingency table contrasting polymorphism vs. divergence at replacement vs. synonymous sites. Key feature: The McDonald-Kreitman test is NOT affected by demography

FixedPolymorphic Replacement72 Synonymous1742 Example: McDonald & Kreitman looked at the ADH (Alcohol dehydrogease) loci in D. melanogaster & D. simulans. 24 fixed differences occur, 7 replacement, 17 synonymous 44 polymorphisms, 2 replacement, 42 synonymous, giving Fisher’s exact test gives p =0.0073

Linkage Disequilibrium (LD) LD arises when allele frequencies alone cannot predict gametic (i.e. chromosomal) frequencies, Freq(AB) = freq(A)*freq(B) When a new mutation appears, it starts in complete LD with the haplotype within which it arose, D = Freq(AB) - freq(A)*freq(B), D(t) = (1-c) t D(0) Over time, recombination decays away much of this block of LD.

Starting haplotype Under pure drift, high-frequency alleles should have short haplotypes time freq

Linkage Disequilibrium Decay One feature of a selective sweep are derived alleles at high frequency. Under neutrality, older alleles are at higher frequencies. Sabeti et al (2002) note that under a sweep such high frequency young alleles should (because of their recent age) have much longer regions of LD than expected. Wang et al (2006) proposed a Linkage Disequilibrium Decay, or LDD, test looks for excessive LD for high frequency alleles Wang et. al used this approach with 1.6 million human SNPs, finding that 1.6% of the markers showed some signatures of positive selection.

Simulation studies by Wang et al. showed that the LDD test effectively distinguishes selection from population bottlenecks and admixture. All genome-based tests have an important caveat. The large number of markers used are typically generated by looking for polymorphisms in a very small, and often not very ethnically-diverse, sample Results in a strong ascertainment bias, for example, an excess of intermediate-frequency markers If such biases are not accounted for, they can skew test results.

Caveats and Unanswered Questions Even if they have experienced very strong selection, domestication genes may not leave a strong signal at linked neutral markers. Must be sufficient background variation for the chance of a sweep being detected. Hamblin et al. (2006) found that the genome-wide background variation in Sorghum is too low to reliably detect signatures of selection. Likely from extreme bottleneck during domestication. If the ancestral species itself had low variation, would also be very difficult to detect selective sweeps.

A more subtle complication results from the frequency of favorable alleles at the start of the domestication process A typical adaptive selective sweep is generally thought to occur following the introduction of a single favorable new mutation. Hence, only one founding haplotype at the time of selection. Selection on domestication alleles is akin to a sudden shift in the environment, with many of these alleles pre-existing in the population before domestication If the frequency of any such an allele is > 0.05, multiple haplotypes are likely present, resulting in considerable variation around the selective site even after fixation, and hence a very weak (if any) signal.

Hence, there is the very real possibility than many important domestication genes will not have left a detectable signature in the pattern of linked neutral variation.

Optimal conditions for detecting selection High levels of polymorphism at the start of selection High effective levels of recombination gives a shorter window around the selective site High levels of selfing reduces the effective recombination rate (eg. Maize vs. rice) Signatures of sweeps persist for roughly N e generations

Domestication vs. improvement genes Domestication genes will leave a signal in all lines, while improvement genes may leave a live-specific signal Unresolved question: Is selection stronger on domestication or improvement genes? Maize: Domestication gene tb1: 90kb sweep, s = 0.05 Improvement gene Y1: 600kb sweep, s = 1.2

Summary Linkage mapping vs. detection of selected loci Linkage: Know the target phenotype Selection: Don’t know the target phenotype Both can suffer from low power and confounding from demographic effects Both can significantly benefit from high-density genomic scans, but these are also not without problems.

U of A Campus Farewell from the “desert”

Searches for regions under selection complement standard linkage-based approaches for QTL detection (line-crosses, association mapping) Using QTL approaches to find domestication genes requires making crosses of wild progenitor x domesticated lines. Localizing adaptation genes to a particular environment via a standard QTL cross very difficult, as one would miss potential pathways to adaptation by focusing only candidate phenotypes thought of by the investigator.

If N e is the effective population size and  the mutation rate, Crow & Kimura showed the equilibrium heterozygosity is given by H= 4N e π 4N e π+1 Thus, H is simply a product of population size and mutation rate. The parameter 4N e  is a fundamental one in molecular evolution and often denoted by .

Genome-Wide Polymorphism Tests As mentioned, general problem with polymorphism tests is that demographic signals can also give the same pattern as selection. Cavalli-Sforza (1966) was among the first to note that demography effects all genomic locations (roughly) equally, while the effects of selection are unique to a particular locus With the advent of very dense marker sets, we are now seeing genome-wide scans over all markers. Idea: Most are not under selection and hence reflect the common demographic features. Outliers against this pattern suggest selection.

MRCA = most recent Common ancestor

Coalescent theory provides an easy way to see why 4N e  appears. For two random sequences within a population, t = 2N e giving 2t  = 4N e  t  mutations Expected number of mutations = 2t 

Detection of domestication genes and other loci under selection.

Similar presentations

Presentation on theme: "Detection of domestication genes and other loci under selection."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detection of domestication genes and other loci under selection.

Similar presentations

Presentation on theme: "Detection of domestication genes and other loci under selection."— Presentation transcript:

Similar presentations

About project

Feedback