Atelier INSERM – La Londe Les Maures – Mai 2004 DETECTING SELECTION FROM DNA SEQUENCE POLYMORPHISM DATA N. GALTIER CNRS UMR 5171 – Génome, Populations, Interactions, Adaptation Université Montpellier 2, France galtier@univ-montp2.fr
SEQUENCE POLYMORPHISM DATA population (species)
SEQUENCE POLYMORPHISM DATA 5 genes DNA fragment (locus) sample population (species) ....ACGGATAGTTAGTGACGATA... ....ACGTATAGCTAGTGACGATA... ....ACGGATAGCTAGTGACGATA... ....ACGGATAGCTAGTGACGATC... site * * * 3 polymorphic (segregating) sites 4 distinct sequences (haplotypes)
SEQUENCE POLYMORPHISM DATA 5 genes sample DNA fragment (locus) population (species) ....ACGGATAGTTAGTGACGATA... ....ACGTATAGCTAGTGACGATA... ....ACGTATAGCTAGTGACGATA... ....ACGGATAGCTAGTGACGATA... ....ACGGATAGCTAGTGACGATC... ....CCAGCTAGCTACTGAAGTTG... outgroup
MUTATIONS SEGREGATING IN A POPULATION (1) sample 1 mutant allele frequency NEUTRAL time Mutations (black dots) arise at rate 2N.m Under neutrality, a new mutation reaches fixation with probability 1/2N This results in a neutral substitution rate of 2N.m / 2N = m (red dots) N: effective population size m: mutation rate The amount of polymorphism in the population at mutation-drift equilibrium is determined by the N.m product, usually measured as q = 4N.m
MUTATIONS SEGREGATING IN A POPULATION (2) 1 mutant allele frequency NEUTRAL 1 mutant allele frequency PURIFYING SELECTION time - a decreased substitution rate Purifying (=negative) selection results in : - a decreased amount of polymorphism - lower allele frequencies
MUTATIONS SEGREGATING IN A POPULATION (3) 1 mutant allele frequency NEUTRAL 1 mutant allele frequency ADAPTIVE SELECTION - an increased substitution rate Adaptive (=positive) selection results in : - a decreased amount of polymorphism - higher allele frequencies
LINKAGE AND HITCH-HIKING SELECTIVE SWEEP sampled neutral locus linked selected locus Directional selection decreases polymorphism at linked (neighbour) neutral sites by increasing the apparent drift.
LINKAGE AND HITCH-HIKING SELECTIVE SWEEP sampled neutral locus linked selected locus Recombination reduces the effect of selection at neighboring loci.
DETECTING SELECTION BY SEEKING REGIONS OF "LOW" POLYMORPHISM Selection reduces polymorphism, but the level of polymorphism is determined by other factors including population size and mutation rate. To make sure that selection is acting, one must control for these nuisance factors. Example: the sliding window strategy p selection or reduced mutation bias? DNA fragment
HITCH-HIKING MAPPING POPULATIONS (distinct N's) 1 2 3 4 5 0.05 A B 0.07 LOCI (distinct m's) C 0.20 D 0.13 0.05 0.06 0.10 E 0.11 F 0.03 A selective sweep occurred at locus D in population 3 - reduced population size (other loci show high polymorphism in pop 3) - low mutation rate (other pops show high polymorphism at locus D) The low amount of polymorphism at locus D, pop 3 cannot be explained by:
focal species outgroup focal species outgroup THE HKA TEST Locus B Locus A Selection has influenced polymorphism at one of the two loci. reduced population size (locus A shows high polymorphism) - low mutation rate (the distance to outgroup is not reduced) The reduced amount of polymorphism at locus B cannot be explained by:
5 4 2 8 focal species outgroup THE McDONALD-KREITMAN TEST synonymous non-synonymous polymorphic fixed 5 4 2 8 focal species outgroup The ratio of nonsynonymous to synonymous is higher between species (divergence) than within species (polymorphism), when the two ratios should be equal under neutrality: positive selection has promoted the fixation of nonsynonymous changes.
COALESCENCE THEORY : FOCUSING ON SAMPLE GENEALOGY 2N chromosomes 1 2 3 k.N . . Time
T2 2N (on average) 4N (on average) T3 T4 T5 COALESCENCE THEORY : THE STANDARD COALESCENT The genealogy of a sample of size n at a neutral locus in a panmictic population of constant size 2N should be like: T2 2N (on average) 4N (on average) T3 T4 T5 where - all topologies are equiprobable - coalescence times Ti’s are exponential random variables of expectation E(Ti)=4N/(i.(i-1)) - mutations are superimposed onto the genealogy according to a Poisson process
THE COALESCENCE PROCESS HAS A HIGH VARIANCE T2 distribution Two realisations of the coalescent with equal Tn, Tn-1, …, T3, but distinct T2
DEPARTURE FROM NEUTRALITY : THE SELECTIVE SWEEP EXAMPLE linked selected sampled neutral SELECTIVE SWEEP neutral genealogy sweep "complete" selective sweep : star-like genealogy
DEPARTURE FROM NEUTRALITY : THE SELECTIVE SWEEP EXAMPLE linked selected sampled neutral SELECTIVE SWEEP neutral genealogy "partial" selective sweep : partly star-like genealogy sweep
DEPAULIS’ HAPLOTYPE TEST neutral genealogy "partial" selective sweep : partly star-like genealogy 9 polymorphic sites 8 haplotypes 9 polymorphic sites 3 haplotypes A partially star-like genalogy results in a number of haplotypes lower than expected given the number of polymorphic sites. Other test statistics aiming at detecting non-neutral shapes of genealogy were proposed: Tajima's D, Fu and Li's F, Fay and Wu's H, ...
DEMOGRAPHY vs SELECTION Detecting a departure from the standard coalescent means that at least one of its assumptions are wrong. Neutrality, unfortunately, is only one of them. Demographic effects (departure from the constant-population size assumption) can distort genealogies in a way very similar to selection. A bottleneck (sudden decrease of population size, followed by a restauration of the former size), for example, has consequences highly similar to that of a selective sweep. To distinguish: multi-locus analysis. Demography impacts the whole genome, while selection is locus-specific.
A LIKELIHOOD-BASED APPROACH M1: neutral, constant size p parameters (q1, ..., qp) T M2: bottleneck p+2 parameters (T, S, q1, ..., qp) T1 T2= T3 M3: selective sweep 3p parameters (T1, S1, q1, ... , Tp, Sp, qp) Calculate and compare the likelihood (probability of the data) under the three models using a likelihood ratio test.
WHAT I DID NOT TALK ABOUT - subdivided populations, migration, isolation by distance, hybrid zones, clines - other forms of selection (e.g. balancing selection) - weak selection applying at many loci (e.g. codon usage) - (biased) gene conversion - patterns of linkage disequilibrium, coalescent with recombination - microsatellites and other non-sequence genetic markers