Natural Selection in Humans

Slides:



Advertisements
Similar presentations
Evolution of genomes.
Advertisements

Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
Supplementary Figure S1 Distribution of observed (blue) and Poisson expected (red) standard deviation of human-chimpanzee divergence over different window.
Gene Expression Levels Are a Target of Recent Natural Selection in the Human Genome Mol. Biol. Evol. 26(3):649– Journal Club
Using genetics to study human history and natural selection David Reich Harvard Medical School Depatment of Genetics Broad Institute.
Lecture 19: Causes and Consequences of Linkage Disequilibrium March 21, 2014.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Atelier INSERM – La Londe Les Maures – Mai 2004
Signatures of Selection
Gene Substitution Dan Graur.
Are we still evolving? Mapping sites of selection in the human genome Simon Myers.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Molecular evolution:   how do we explain the patterns of variation observed in DNA sequences? how do we detect selection by comparing silent site substitutions.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Human population migrations Out of Africa, Replacement –Single mother of all humans (Eve) ~150,000yr –Single father of all humans (Adam) ~70,000yr –Humans.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
Hidenki Innan and Yuseob Kim Pattern of Polymorphism After Strong Artificial Selection in a Domestication Event Hidenki Innan and Yuseob Kim A Summary.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
 Archaeology – “the scientific study of material remains (as fossil relics, artifacts, and monuments) of past human life and activities”  Studies.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
14 Population Genetics and Evolution. Population Genetics Population genetics involves the application of genetic principles to entire populations of.
BASIC FACTS ABOUT MALARIA n Four Plasmodium species cause human malaria: P. falciparum (the most virulent), P. vivax, P. malariae, and P. ovale. Human.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
Genetic Linkage. Two pops may have the same allele frequencies but different chromosome frequencies.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
INTRODUCTION TO ASSOCIATION MAPPING
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Detection of positive selection in humane genome.
Selectionist view: allele substitution and polymorphism
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
The International Consortium. The International HapMap Project.
In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Genomics of Adaptation
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Testing the Neutral Mutation Hypothesis The neutral theory predicts that polymorphism within species is correlated positively with fixed differences between.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University.
In populations of finite size, sampling of gametes from the gene pool can cause evolution. Incorporating Genetic Drift.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
LECTURE 9. Genetic drift In population genetics, genetic drift (or more precisely allelic drift) is the evolutionary process of change in the allele frequencies.
Single Nucleotide Polymorphisms (SNPs
Genetic Linkage.
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Signatures of Selection
Genetic Linkage.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Detection of the footprint of natural selection in the genome
Genome-wide Associations
The ‘V’ in the Tajima D equation is:
The Evolution of Populations
Genetic Drift, followed by selection can cause linkage disequilibrium
Summary of the population genetics measures used in this study.
Genomic Signatures of Selective Pressures and Introgression from Archaic Hominins at Human Innate Immunity Genes  Matthieu Deschamps, Guillaume Laval,
Jonathan K. Pritchard, Joseph K. Pickrell, Graham Coop  Current Biology 
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Presentation transcript:

Natural Selection in Humans Sharareh Noorbaloochi CS 374 Oct 10, 2006

Papers to be presented… Science, 16 June 2006, Volume 312 PLoS Biology, March 2006, Volume 4

Overview Pursuit of natural selection Biological Background Methods for detecting positive selection Genome-wide studies From candidate to function In intro: motivation and why we care….put screen shot of each paper as well

Images from: Voight et al. 2006 Migration out of africa: Images from: Voight et al. 2006

Adaptability of Modern Humans Humans have undergone tremendous cultural and environmental changes during the last ~40-50 KY. Spread around the world (migrate out of Africa 100KY) Global warming trend since last ice age ~14 KYA Transition from hunter to agricultural society (<~10KYA) Increase in pathogen load due to greater population density and proximity to livestock The evolution of modern human populations has been accompanied by dramatic changes in environment and lifestyle. In the last 100,000 years, behaviorally modern humans have spread from Africa to colonize most of the globe. In that time, humans have been forced to adapt to a wide range of new habitats and climates. Following the end of the last ice age, 14,000 years ago, there was a major warming event that raised global temperatures to roughly their current levels. Further dramatic changes occurred with the transition from hunter-gatherer to agricultural societies, starting about 10,000–12,000 years ago in the Fertile Crescent, and a little later elsewhere. This was also a period marked by rapid increases in human population densities. Increased population density promoted the spread of infectious diseases, as did the new proximity of farmers to animal pathogens [1,2]. Voight et al. (2006)

Pursuit of Natural Selection Each of these kinds of changes likely resulted in powerful selective pressures for new genotypes that were better suited to the novel environments. Indeed, there are a number of recent reports of genes that show signals of very strong and recent selection in favor of new alleles: for example, in response to malaria [3–5]; at the lactase gene in response to dairy farming [6]; at a salt sensitivity variant in response to climate [7]; and in genes involved in brain development [8,9]. To date, the best examples of recent selection in humans have all been discovered in studies of candidate genes where there was a prior hypothesis of selection. Hence, very little is known about how widespread such signals are; nor is there unbiased information about what kinds of genes or biological processes are most involved in the adaptation of modern humans. It is unclear whether these genes are the same kinds of genes that were most important in the earlier evolution of the Homo lineage, as identified from comparisons with chimpanzees [10–12]. It is also not known to what extent recent selective events have been geographically restricted, as opposed to taking place in all populations. A number of recent studies have detected more signals of adaptation in non-African populations than in Africans [13–17], and some of those studies have conjectured that non-Africans might have experienced greater pressures to adapt to new environments than Africans have. In this study, we use newly

Candidate Genes: Selection G6PD [malaria resistance] Tishkoff et al 01, Sabeti et al 02 Lactase [dairy farming] Bersaglieri et al 04 MCPH1, ASPM [brain development] Evans et al 05, Mekel-Bobrov et al 05 Spend 0.1 s on this slide

Some Facts In human beings, 99.9 percent of bases are the same. Remaining 0.1 percent makes a person unique. Different attributes / characteristics / traits how a person looks, diseases he or she develops. These variations can be: Harmless (change in phenotype) Harmful (diabetes, cancer, heart disease, Huntington's disease, and hemophilia ) Latent (variations found in coding and regulatory regions, are not harmful on their own, and the change in each gene only becomes apparent under certain conditions e.g. susceptibility to lung cancer) The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. c (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype.

Human Genetic Variations Two types of genetic mutation events for today: Single base mutation which substitutes one nucleotide for another -- Single Nucleotide Polymorphisms (SNP) Insertion or deletion of one or more nucleotide(s) --Tandem Repeat Polymorphisms --Insertion/Deletion Polymorphisms Structural variations also important (copy numbers) One of the Most common type of genetic variation Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced "snips"). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.

What is SNP ? A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more than 1 percent) of a large population. For example a SNP might change the DNA sequence AAGGCTAA  ATGGCTAA.

SNP facts SNPs are found in coding and (mostly) noncoding regions. Occur with a very high frequency about 1 in 1200 bases on average. approximately 10 million SNPs occur commonly in the human genome. One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.

Allele Allele: Any one of a number of viable DNA codings occupying a given locus (position) on a chromosome. Usually alleles are DNA sequences that code for a gene, but sometimes the term is used to refer to a non-gene sequence. In a diploid organism, like humans, one that has two copies of each chromosome, two alleles make up the individual's genotype. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A.

Haplotype Haplotype is a set of SNPs on a single chromatid that are statistically associated.

SNP Maps Sequence genomes of a large number of people Compare the base sequences to discover SNPs. Generate a single map of the human genome containing all possible SNPs => SNP maps

SNP Maps

The HapMap Project The DNA samples for the HapMap come from a total of 270 people: Yoruba people in Ibadan, Nigeria (30 both-parent-and-adult-child trios), Japanese in Tokyo (45 unrelated individuals), Han Chinese in Beijing (45 unrelated individuals), CEPH (European) (30 trios). These numbers of samples will allow the Project to find almost all haplotypes with frequencies of 5% or higher. Ascertainment Bias (not enough samples to look at lower frequencies than 5%) The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease. The DNA in our cells contains long chains of four chemical building blocks -- adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See http://www.dnaftb.org/dnaftb/ for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment. The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype. http://www.hapmap.org/index.html.en

Hapmap, SNPs, Haplotype, Tag SNPs The construction of the HapMap occurs in three steps. (a) Single nucleotide polymorphisms (SNPs) are identified in DNA samples from multiple individuals. (b) Adjacent SNPs that are inherited together are compiled into "haplotypes." (c) "Tag" SNPs within haplotypes are identified that uniquely identify those haplotypes. By genotyping the three tag SNPs shown in this figure, researchers can identify which of the four haplotypes shown here are present in each individual. In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs. Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person's DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.

SNPs may / may not alter protein structure Genetic variants that alter protein functions are usually deleterious  less likely to become common or fixated. Synonymous: AKA silent mutation, are mutations that have no functional affect on the protein. Non-synonymous: amino acid-altering mutations  sickle cell anemia Degeneracy of Genetic Code Synonymous Non- Synonymous

Alleles, polymorphism,… One of two or more alternative forms of a gene or DNA sequence at a specific chromosomal location. Example: A or a Polymorphism Arbitrary defined as existing when at least two alleles are in the population and the minor allele is at a 1% or greater frequency. Variant An allele below the 1% frequency is sometimes called a variant. Fixation: The process by which one allele increases in a population until all other alleles go extinct and the locus becomes monomorphic. Simply: 100% frequency.

Haplotype, hitchhiking, selective sweep An allele that rises to a high frequency through positive selection at a linked locus is said to be “hitchhiking”. The reduction in diversity at loci linked to a recently fixed allele is dubbed as “selective sweep”. Ascertainment bias: Distortion in a dataset caused by the way the markers or samples are collected.

How does human history affect genetic variation? A genome-wide survey of Linkage Disequilibrium Linkage disequilibrium is a phenomenon whereby genetic variants are associated: people who have one variant tend to have a second variant as well. Slide by: David Reich, Broad Institute

Variation Over time Emergence of Variations Over Time Common Ancestor Emergence of Variations Over Time time present Variations in Chromosomes Within a Population Mutation Slide by: David Reich, Broad Institute

Linkage Disequilibrium (LD) Linkage Disequilibrium (LD): Nonrandom association between alleles at two or more loci, not necessary on the same chromosome, due to their tendency to be coinherited because of reduced recombination between them. Linkage disequilibrium (LD) is a term used in the study of population genetics for the non-random association of alleles at two or more loci, not necessarily on the same chromosome. It is not the same as linkage, which describes the association of two or more loci on a chromosome with limited recombination between them. LD describes a situation in which some combinations of alleles or genetic markers occur more or less frequently in a population than would be expected from a random formation of haplotypes from alleles based on their frequencies. Linkage disequilibrium is caused by fitness interactions between genes or by such non-adaptive processes as population structure, inbreeding, and stochastic effects. In population genetics, linkage disequilibrium is said to characterize the haplotype distribution at two or more loci. Formally, if we define pairwise LD, we consider indicator variables on alleles at two loci, say I1,I2. We define the LD parameter δ as: Here p1,p2 denote the marginal allele frequencies at the two loci and h12 denotes the haplotype frequency in the joint distribution of both alleles. Various derivatives of this parameter have been developed. In the genetic literature the wording "two alleles are in LD" usually means to imply . Contrariwise, linkage equilibrium, denotes the case δ = 0. - show the formulas on pge 130. Pathogenic: Relating to the causation of disease.

LD Continued LD describes a situation in which some combinations of alleles or genetic markers occur more or less frequently in a population than would be expected from a random formation of haplotypes from alleles based on their frequencies. Illustrative Example: Consider 2 neighboring loci A and B with two alleles A and a, and B and b- at each locus If no association: p (AB) = pA x pB If there is association: D = PAB – PAx PB Track LD: Dt = (1-r)t x D0 r is the recombination rate

Take each oval and enlarge them and make it interactive

What Determines Extent of LD? 2,000 gens. ago Mutation 1,000 gens. ago Time = present Recombination is the key! Slide by: David Reich, Broad Institute

Positive Natural Selection Neutral Evolution Versus Positive Natural Selection

Neutral Evolution Generations Genetic Drift: slow process 1 2 3 4 5 6 7 8 9 10 Most of the genome is thought to be evolving neutrally, that is most of the mutations don’t have any effect on the fittness. When a neutral mutation occurs, its frequency in the population changes randomly. It is a slow process, known as genetic drift. Genetic Drift: slow process Frequency of the neutral mutations in the population changes randomly. Generations Reproduced from Sabeti et al.

Positive Natural Selection 1 Positive Selection: A selective regime that favors the fixation of an allele that increases the fitness of its carrier. Fixation: The process by which one allele increases in a population until all other alleles go extinct and the locus becomes monomorphic. Simply: 100% frequency. The situation is different when selection operates, when a mutation will make it more possible for an individual to survive and to produce, it can spread quickly,infact it can go all the way to fixation that is 100% frequency. 2 3 4 5 6 7 8 9 10 Generations Reproduced from Sabeti et al.

Methods for detecting selection Difference between species High proportion of function altering mutations Within-species variation Low diversity Excess of derived alleles Differences between populations Long unbroken haplotypes

Methods for detecting selection Test 1: Function altering mutations Age: many millions of years

P. C. Sabeti et al., Science 312, 1614 -1620 (2006) Test 1: High proportion of function altering mutations Excess of function-altering mutations in PRM1 exon 2 Over a prolonged period, positive selection can increase the fixation rate of beneficial function-altering mutations. Signature detected by comparing rates of mutations power limited: needs multiple selected changes before gene stands out from background neutral rate Synonymous mutation genetic variants that alter protein function are usually deleterious and are thus less likely to become common or reach fixation (i.e., 100% frequency) than are mutations that have no functional effect on the protein (i.e., silent mutations). Positive selection over a prolonged period, however, can increase the fixation rate of beneficial function-altering mutations (20, 21), and such changes can be measured by comparison of DNA sequence between species. The increase can be detected by comparing the rate of nonsynonymous (amino acid–altering) changes with the rate of synonymous (si-lent) or other presumed neutral changes, by comparison with the rate in other lineages, or by comparison with intraspecies diversity. One extreme example of this kind of signature is found in the gene PRM1, mentioned earlier, which has 13 nonsynonymous and 1 synonymous differences between human and chimpanzee This signature can be detected over a large range of evolutionary time scales. Moreover, it focuses on the beneficial alleles themselves, eliminating ambiguity about the target of selection. Its power is limited, however, because multiple selected changes are required before a gene will stand out against the background neutral rate of change. It is thus typically possible to detect only ongoing or recurrent selection. In practice, when the human genome is surveyed in this manner, few individual genes will give statistically significant signals, after correction for the large number of genes tested. However, the signature can readily be used to detect positive selection across sets of multiple genes (25). For example, genes involved in gameto-genesis clearly stand out as a class having a high proportion of nonsynonymous substitutions (25–27). Common Statistical test: Ka/Ks test Relative rate test McDonald-Kreitman test P. C. Sabeti et al., Science 312, 1614 -1620 (2006)

Ka/Ks test (Li et al. 1985) Ka/Ks > 1 Main idea: Contrast two types of substitutions events. Goal of the test: calculate the synonymous rate (Ks) and the non-synonymous rate (Ka), at each codon site. Purified (negative) selection: Ka decreases, Ka/Ks < 1 is indicative of purifying selection. Positive Selection: Ka increases (replacement of amino acid is beneficial to the organism) Ka/Ks > 1 In rare cases where the opposite pattern is observed (Ka /Ks>1), "positive Darwinian selection" is usually invoked as a possible explanation. The idea is that in such positive sites, rapid replacement of an amino acid is advantageous to the organism; hence mutations are fixed at a rate higher than that of neutral mutations,

Within-species Tests Test 2: Test 3: Test 4: Sweep Signatures Low diversity, many rare alleles (age < 250,000 years) Test 3: Many high frequency derived alleles (age < 80,000 years) Test 4: Long common haplotypes unbroken by recombination (age < 30,000 years) Things to clarify: low diversity, rare alleles, derived alleles, haplotypes

Within-species Test 5: population Difference Age < 50,000 to 70,000 years Example 1 Extreme population differences (PD) in FY*O allele frequency. FY*O allele, which confers resistance to P. vivax malaria, is prevalent and even fixed in many African populations, but virtually absent outside Africa.

P. C. Sabeti et al., Science 312, 1614 -1620 (2006) 5 Signatures of Selection Tests based on human genetic variation includes (last 4 tests): Development of anatomically modern human Migration out of Africa Development of agriculture These different signatures of selection probe different periods of the evolutionary history, for example, difference between species are persistent and they can let us look for selections that occurred many millions of years ago. Test based on human genetic variation probe much shorter time periods, from a few tens to a few hundreds of thousand years. Much shorter but still long enough to include the development of anatomically modern humanm the migration out of Africa, and the development of agriculture. Published by AAAS P. C. Sabeti et al., Science 312, 1614 -1620 (2006)

Genome-wide Studies New data sets make genome surveys possible Full sequence for human, chimpanzee, mouse Dense surveys of human genetic variation Newly available data sets have fundamentally changed how we study natural selection in humans moving toward genome wide surveys, these key data sets includes full sequence for humans and other species as well as dense survey of human genetic variation.

Between-species results Limited power to detect selection at single genes Powerful for functional classes of genes rapidly changing: Sperm-related genes Olfactory (sense of smell) receptors The rich sequence data has made between species of function altering mutations possible, these studies have limited power to detect selection at individual genes but it provides insight in the evolution of functional classes of genes. Rapidly evolving classes of genes that are identified include sperm-related genes and genes that are involved in the olfaction (sense of smell)

Finding selective sweeps Statistical tests: Distinguish the pattern of genetic variation expected under neutrality from that expected under natural selection Pick a statistical test to detect sweeps In studies of selective sweeps and in all studies of genome-wide variation, scientists have developed statistical tests that distinguishes the pattern of genetic variation expected under neutrality from that expected under neutral selection, they then apply that statistical test to the who genome, identifying candidate regions for selection. Scientist devleope these statistical tests by using theoretical expectations and simulations of neutral variations, If the neutral variations in the genome was completely predicted by simulations we could then in principle simply identify the variants that are outliers from this distribution as possible candidates of selection. The problem however is that We do not fully know the shape of the neutral distribution and how it’s affected by other factors such as demographic history. Demographic: a single vital or social statistic of a human population, as the number of births or deaths. Apply the statistic across the genome

Most likely candidate of selection Finding selective sweeps Problem We do not fully know the shape of the neutral distribution and how it’s affected by other factors such as demographic history. However, the best we can do: use statistic based on simulations apply it to empirical genome-wide data sets Identify the loci in the extreme tail Most likely candidate of selection Therefore, the best we can do is to use the statistic based on our existing simulations and apply them to our empirical genome-wide sets, then identify the loci in the extreme tail, that would be the most likely candidate for natural selection

No Selection Positive Selection Test based on the relationship between allele frequency and extent of linkage disequilibrium Young alleles: • low frequency • long-range LD (long haplotypes) No Selection Old alleles: • low or high frequency • short-range LD Positive Selection Young alleles: • high frequency • long-range LD Slide by: David Reich, Broad Institute

The signal of selection Neutrality Linkage Disequilibrium (Homozygosity) Positive Selection Frequency of core haplotype frequency Slide by: David Reich, Broad Institute

Let us understand these methods better …

Methods for detecting selection Within-species Tests Test 2: Low genetic diversity/many rare alleles Age < 250,000 years

Test 2: Low genetic diversity/many rare alleles As allele increases in population frequency  variants at nearby locations on the same chromosome (linked variants) rise in frequency. Such so-called "hitchhiking"  "selective sweep”. Most common type of variant used: SNPs Low diversity and many rare alleles at the Kell blood antigen cluster Common Statistical Test: Tajima’s D Hudson-Kreitman-Aguade (HKA) Fu and Li’s D* Reduction in genetic diversity (age <250,000 years). As an allele increases in population frequency, variants at nearby locations on the same chromosome (linked variants) also rise in frequency. Such so-called "hitchhiking" leads to a "selective sweep," which alters the typical pattern of genetic variation in the region. In a complete selective sweep, the selected allele rises to fixation, bringing with it closely linked variants; this eliminates diversity in the immediate vicinity and decreases it in a larger region. New mutations eventually restore diversity, but these appear slowly (because mutation is rare) and are initially at low frequency. Positive selection thus creates a signature consisting of a region of low overall diversity, with an excess of rare alleles. Unlike excess functional changes, which involve differences between species, selective sweeps are detected in genetic variation within a species. The most common type of variant used is the single-nucleotide polymorphism (SNP). As an example, Akey et al. identified a 115-kb region containing four genes including the Kell blood antigen, which showed an overall reduction in diversity and more rare alleles in Europeans than expected under neutrality (Fig. 3) (28). Statistical tests commonly used to detect this signal include Tajima's D, the Hudson-Kreitman-Aguadé (HKA) test, and Fu and Li's D* (29–32). On the basis of three different statistical tests, the 115-kb region(contatining four genes) shows evidence of a selective sweep in Europeans. P. C. Sabeti et al., Science 312, 1614 -1620 (2006)

Methods for detecting selection Within-species Tests Test 3: High-frequency derived alleles Age < 80,000 years

Test 3: Many high-frequency derived alleles Derived alleles: non-ancestral alleles Arise by new mutations Typically lower allele frequencies than ancestral However, in selective sweep, derived alleles linked to the beneficial alleles can hitchhike to high frequency. Derived (that is, nonancestral) alleles arise by new mutation, and they typically have lower allele frequencies than ancestral alleles (33). In a selective sweep, however, derived alleles linked to the beneficial allele can hitchhike to high frequency. Because many of these derived alleles will not reach complete fixation (as a result of an incomplete sweep or recombination of the selected allele during the sweep), positive selection creates a signature of a region containing many high-frequency derived alleles. A good example of this kind of signature is the 10-kb region around the Duffy red cell antigen (FY), which has an excess of high-frequency derived alleles in Africans, thought to be the result of selection for resistance to P. vivax malaria (Fig. 4) (34, 35). The most commonly used test for derived alleles is Fay and Wu's H (36). FIGURE: The 10-kb region near the gene has far greater prevalence of derived alleles (represented by red dots) than of ancestral alleles (represented by gray dots) Figure: Excess of high-frequency derived alleles at the Duffy red cell antigen (FY) gene (34) P. C. Sabeti et al., Science 312, 1614 -1620 (2006)

Methods for detecting selection Within-species Tests Test 5: Differences between populations Age < 50,000 to 70,000 years

Test 5: Differences between populations Geographically separate populations are subject to distinct environmental or cultural pressures  change of allele frequency in one populations and not the other. Can only arise when populations are at least partially isolated reproductively. For humans, after the major human migrations out of Africa some 50,000 to 70,000 years ago. Weakness of the test: similar to other population genetic signatures, distinguishing between genuine selection and the effect of demographic history (especially population bottleneck) on genetic variation can be hard. Common Statistical Tests: FST Pexcess Differences between populations (age <50,000 to 75,000 years). When geographically separate populations are subject to distinct environmental or cultural pressures, positive selection may change the frequency of an allele in one population but not in another. Relatively large differences in allele frequencies between populations (at the selected allele itself or in surrounding variation) may therefore signal a locus that has undergone positive selection. For example, the FY*O allele at the Duffy locus is at or near fixation in sub-Saharan Africa but rare in other parts of the world, an extreme case of population differentiation (Fig. 5) (34, 38). Similarly, the region around the LCT locus demonstrates large population differentiation between Europeans and non-Europeans, reflecting strong selection for the lactase persistence allele in Europeans (6). Commonly used statistics for population differentiation include FST and pexcess (39–41). Reduction in size of a single, previously larger, population and a loss of prior diversity.

Extreme population Difference Example 1 Extreme population differences (PD) in FY*O allele frequency. FY*O allele, which confers resistance to P. vivax malaria, is prevalent and even fixed in many African populations, but virtually absent outside Africa.

Extreme population Difference Example 2: Region around LCT locus demonstrates large PD between Europeans and non-Europeans  Strong selection for lactase persistence allele in Europeans. LCT

Genome-wide Survey using Tests: Low diversity and population separation Several studies have now produced genome-wide distributions for varying tests of selection and identified the outliers of the distribution. Here, we present plots of diversity versus population differentiation. Signal of selective sweep occur here in regions of low diversity and high population differentiation. These surveys have made it possible to find many novel candidates, also they also allow us to re-examine previously published candidates in comparison with genome-wide distributions. Some candidates remain among the strongest signals such as lactase and imino-globiline A (IgA) while others do not. Outliers: low diversity with high population differentiation

A Little Break?

Interesting fact: Pardis Sabeti is a rock star!

Back to work now…

Methods for detecting selection Within-species Tests Test 4: Long Haplotypes Age < 30,000 years

Recent past Present A short while later… Mb 0.1 0.2 0.3 Advent of a beneficial allele A short while later… Recent past Present Mb 0.1 0.2 0.3 Under positive selection, a selected allele may rise in prevalence rapidly enough that recombination does not substantially break down the association with alleles at nearby loci on the ancestral chromosome. Such a collection of alleles in a chromosomal region that tend to occur together in individuals is termed a haplotype. Selective sweeps can produce a distinctive signature that would not be expected under neutral drift—namely, an allele that has both high frequency (typical of an old allele) and long-range associations with other alleles (typical of a young allele). The long-range associations are seen as a long haplotype that has not been broken down by recombination. For example, the lactase persistence allele at the LCT locus lies on a haplotype that is common ( 77%) in Europeans but that extends largely undisrupted for more than 1 million base pairs (1 Mb) (Fig. 6) (6), much farther than is typical for an allele of that frequency. This signature can be detected with the long-range haplotype (LRH) test, haplotype similarity, and other haplotype-sharing methods (42–45). Developing such tests is an area of vigorous current investigation (46, 47). Model: Maynard Smith and Haigh, 1974, Simulation by SelSim, Coop and Spencer, 2004

Haplotype Core Haplotypes gene 1 2 3 4 5 Adjacent SNPs that are inherited together are compiled into "haplotypes." Slide by: David Reich, Broad Institute

Long-range multi-SNP haplotypes Core markers gene C/T A/G A/G C/T C/T C/T Long-range markers 5 3 2 1 4 Decay of LD Slide by: David Reich, Broad Institute

Powerful general approach for detecting selection 3 2 1 4 Slide by: David Reich, Broad Institute

Long-range multi-SNP haplotypes gene C/T A/G Core markers Long-range markers T C A G 18% G C T 35% 75% 3 EHH estimates the level of haplotype splitting due to recombination and mutation at extended regions on both sides of a specified core region. 100% Decay of homozygosity (probability, at any distance, that any two haplotypes that start out the same have all the same SNP genotypes) Slide by: David Reich, Broad Institute

Test Statistic: Extended Haplotype Homozygosity (EHH) (A) Decay of haplotypes in a single region in which a new selected allele (red) is sweeping to fixation, replacing the ancestral allele(blue). (B) Decay of haplotype homozygosity for ten replicate simulations. Right side: derived alleles are favored sigma=2Ns= 250 Figure 1. Decay of EHH in Simulated Data for an Allele at Frequency 0.5 (A) Decay of haplotypes in a single region in which a new selected allele (red, center column) is sweeping to fixation, replacing the ancestral allele (blue). Horizontal lines are haplotypes; SNP positions are marked below the haplotype plot using blue for SNPs with intermediate allele frequencies (minor allele .0.2), and red otherwise. For a given SNP, adjacent haplotypes with the same color carry identical genotypes everywhere between that SNP and the central (selected) site. The left- and right-hand sides are sorted separately. Haplotypes are no longer plotted beyond the points at which they become unique. (B) Decay of haplotype homozygosity for ten replicate simulations. When the core SNP is neutral (s¼0; left side) the haplotype homozygosity decays at similar rates for both ancestral and derived alleles. When the derived alleles are favored (s¼2Ns¼250; right side), the haplotype homozygosity decays much slower for the derived alleles than for the ancestral alleles. The discrepancy in the overall areas spanned by these two curves forms the basis of our text for selection (iHS). DOI: 10.1371/journal.pbio.0040072.g001 Haplotype homozygosity (HH) is an effective measure of linkage disequilibrium (LD) for more than 2 markers. If we want to see, how LD breaks down with increasing distance to a specified core region, we can calculate HH in a stepwise manner as extended HH (EHH, Sabeti et al. 2002). HH is evaluated as HH = [sum(pi2) - 1/n] / [1 - 1/n] with pi being the relative haplotype frequency and n the sample size. EHH estimates the level of haplotype splitting due to recombination and mutation at extended regions on both sides of a specified core region. In combination with the core haplotype frequency it may also serve as an indicator of recent positive selection. Frequent core haplotypes with an unusually high long-range LD are supposed to be positively selected. The various core haplotypes can serve as internal controls.

Simulations of decay of haplotype homozygosity Have to explain EHH here. Sharareh: As shown in the figure, when an allele rises rapidly in frequency due to strong selection, it tends to have high levels of haplotype homozygosity extending much further than expected under a neutral model. hence in plots of EHH versus distance, the area under the EHH curve wil usually be much greater for a selected allele than for a neutral allele. What is sigma? 10 simulations: SelSim Haplotype Homozygosity: Sabeti et al. 2002

iHS: Measures the extent of haplotypes along alleles at a given SNP Derived Allele Ancestral Allele EHH Perhaps triage this slide? 0.05 Genetic Distance iHHA : iHH with respect to Ancestral core allele. iHHD : iHH with respect to Derived core allele.

iHS Score Useful for variants that have not yet reached fixation. Large negative iHS: derived allele has swept up in frequency Large positive iHS: an ancestral alleles hitchhike with the selected sites. Hence, both cases are considered interesting!

The Data: Hapmap Project 860,000 SNPs genome-wide 60 unrelated individuals European (CEPH): CEU Nigerians from Ibadan (Yoruba): YRI 89 unrelated individuals Han Chinese from Beijing and Japanese from Tokyo: ASN

|iHS| |iHS| |iHS| -iHS

Lines of Evidence: Selection Enrichment of signal in genic relative to non-genic regions (p < 10-20) Replication of previously published candidates LCT (Bersagleri et al 2004, Coelho et al 2005) ADH (Osier et al 2002) 17q23inv (Stefanson et al 2005) CYP3A5 (Thompson et al 2004) Ch. 11 Olfactory gene Cluster (Gilad et al, 2003) Correlates with departures in the frequency spectrum (Fay and Wu, 2000)

SYT1, Yoruba Haplotype Decay at SYT1, iHS = -4.7 Binds Ca2+; implicated in release of neurotransmitters [OMIM] Plot of high iHS scores on Chromosome 12 Haplotype Decay at SYT1, iHS = -4.7

SPAG4, East Asians Haplotype Decay at SPAG4, iHS = -3.1 Interacts with ODF27, gene found in mammalian sperm tails [OMIM] Plot of high iHS scores on Chromosome 20 Haplotype Decay at SPAG4, iHS = -3.1 Describe scatter plot; perhaps cut one of these slides (if time constraints)

How much do regions overlap across populations?

Distribution of Regions Across Populations

Do signals of selection correlate with known biological processes?

Enriched Ontological Categories Olfaction [CEU/YRI] Gametogenesis/Fertilization [ASN/CEU] MHC-I related Immunity [CEU/YRI] Metabolism [All]

Other Interesting Stories... Skin Pigmentation MYO5A, OCA2, DTNBP1, TYRP1, SLC24A5* [CEU] Sugar Metabolism MAN2A1 [ASN/YRI]; SI [ASN]; LCT [CEU] Processing of Fatty Acids SLC27A4, PPARD [CEU]; LEPR [ASN]; NCOA1 [YRI] Add MCPHX – XX; Tell stories, don’t say where they are from (already mentioned it).

Conclusion on iHS Method Pervasive signals of positive selection across the human genome Both population specific and signals shared between populations Strong evidence of selection in Africa (unlike other reports) Putative medical relevance because: Have phenotypic consequences Differences between populations

Happlotter http://hg-wen.uchicago.edu/selection/haplotter.html

That's all folks!