Multifactorial traits and complex genetics I Genome-wide association studies in humans gavin.band@well.ox.ac.uk Wellcome Trust Centre for Human Genetics The title of my talk is genome-wide studies as this has been the main focus of my research over the last 3 years.
Overview Describe studies aiming to find genetic differences between individuals that influence susceptibility to diseases (or other traits).
Why find disease genes? Identify putative drug targets. Identify high risk individuals. Gene therapy? Personalised medicine (e.g. stratifying cancer) Understand the biology of disease.
How do genetic factors influence traits? Two somewhat competing views Genetic influence on traits is inherited in big, discrete lumps “Mendelian inheritance” - Gregor Mendel (1865) - Morgan (1915) - e.g. Discovery of ABO blood group (1924) Genetic influence on traits is inherited in essentially continuous quantities “biometrical”, “multifactorial”, “polygenic” viewpoint - Darwin 1859 - Galton 1886 (e.g. human height) There were two original competing views about inheritance (before we knew anyhting about genetics.) Although these were synthesised mid-1900s by Fisher and others, these views in combination with technological aspects have still somewhat influenced disease genetics studies. Fisher, Haldane, Wright 1920s-1930s The modern evolutionary synthesis
Genomics timeline 1950’s – structure of DNA 1970’s – ‘Sanger sequencing’ 1980’s – RFLP (genetic barcode / inexpensive genotyping of marker loci) 1990’s – Linkage studies using RFLPs 2000’s – Human Genome Project completed; International HapMap project; first genotyping microarrays; first large-scale association studies. 2010’s – 1000 Genomes Project; direct-to-consumer genetic testing Present – Massively large-scale biobank / population sequencing projects (UK Biobank), 100 000 genomes project (UK); Precision Medicine Initiative (US), …
Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis Genome-wide Linkage Analysis Until recently the standard approach to finding disease genes consist of several key stages common to the majority of studies. Primarily, for a disease or phenotype to be suitable for genetic analysis it has to be shown to heritable. For example, it may be observed that incidences of a disease tends to cluster in families. Analysis of family aggregation is used to demonstrate that an individuals chance of having a disease phenotype increases when a close relative also exhibits the same phenotype. The analysis of twins or of adopted individuals can be used to rule out common environmental factors as being responsible, leaving a genetic component to disease the most likely explanation. One this has been established, segregation analysis can help assess the relative likelihood of different models of disease inheritance. Analysis of the segregation of disease phenotypes within pedigrees provides the first hint of the number of genes likely to be underlying the phenotype, and how penetrant they might be; but it cannot provide information on which genes are responsible. Up to this point no genetic data has been collected, and most of the statistical analysis involves summing over the possible genotypes of individuals within families. By genotyping genetic markers within families, linkage analysis aims to infer the physical location of disease causing genes along chromosomes by comparing patterns of disease status within families, to the patterns of inheritance of genetic markers. Strong correlations between disease phenotypes and marker inheritance provides evidence that the casual mutation lies nearby. For example…
Linkage Mapping Small number of typed markers ABC abc abc A chromosome A/a B/b C/c … ABC abc abc A chromosome ABC abc abc = Affected = Unaffected Males – squares Females - circles aBC abc abc ABC abc Abc abc abC abc ABc abc ABC abc
Linkage Mapping Typical result if successful – a strong signal (good) but not well localised within a chromosome. Initial discovery led to finding of APOE variants affecting risk of Alzheimers. However, as there are only a handful of recombination events with in a family along each chromosome the size of the region in which the it is possible the disease causing mutation lies is often a sizable fraction of the chromosome, potentially containing hundreds of genes. This study collected 32 extended families (!) and localised a signal to somewhere on chromosome 19. But more work was needed. chromosome Pericak-Vance et al, Am. J. Hum. Gen (1991)
Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis Genome-wide Linkage Analysis Candidate Gene Studies + Fine Mapping For this reason, the next step in the process is to type more markers either in larger pedigrees or in unrelated individuals, either across the region, or in genes which are candidates due to their function. If successful this fine-scale analysis will elucidate the genes responsible for the linkage peak, and efforts would then focus on understanding the biology of the genetics, often in animal models, with the aim of developing therapeutics. Gene Characterization
Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis We aren’t very good at this! Genome-wide Linkage Analysis Candidate Gene Studies + Fine Mapping For this reason, the next step in the process is to type more markers either in larger pedigrees or in unrelated individuals, either across the region, or in genes which are candidates due to there function. If successful this fine-scale analysis will elucidate the genes responsible for the linkage peak, and efforts would then focus on understanding the biology of the genetics, often in animal models, with the aim of developing therapeutics. Gene Characterization
Successes and Failures Linkage Mapping has been successful in identifying the genetic basis of many human diseases in which the disease penetrance resembles a simple Mendelian model e.g. Huntington’s disease (HD 1993), Cystic Fibrosis, some forms of breast cancer (BRCA1 1993), Alzheimers (APOE 1991)… But “the literature is now replete with linkage screens for an array of common ‘complex’ disorders such as schizophrenia, manic depression, autism, asthma, type I and type II diabetes, Multiple Sclerosis, Lupus. Although many of these studies have reported significant linkage findings, none has lead to convincing replication” – Risch (2000) The approach we have just outlined has proven successful in identifying the genetic basis of many diseases in which a single, or small number of genes explain a large proportion of the incidence of disease in the population. Good examples include … The same approach has been applied to many more complex disease, but with much less success. The difference between complex disease, and so-called Mendelian disease, are that they are thought to be caused by multiple genes which potentially interact with one another, and/or with the environment. As a consequency the disease risk attributed carrying a susceptibility allele is much smaller than for more Mendelian disease. Although the literature includes many studies which claim success, attempts at replicating the result for complex disease, in other samples collections, has often failed. Suggesting that many are false positive and the approach flawed. Risch wrote…
Successes and Failures Why? It’s because linkage studies aren’t the right study design for detecting non-Mendelian-like effects. These so-called ‘complex’ traits have fundamentally different genetic architectures. P( disease | risk allele ) Relative risk = P( disease | non-risk allele ) ‘Mendelian’-like trait => RR > 4 or so, i.e. you are many times more likely to get disease if you are a risk allele carrier. The approach we have just outlined has proven successful in identifying the genetic basis of many diseases in which a single, or small number of genes explain a large proportion of the incidence of disease in the population. Good examples include … The same approach has been applied to many more complex disease, but with much less success. The difference between complex disease, and so-called Mendelian disease, are that they are thought to be caused by multiple genes which potentially interact with one another, and/or with the environment. As a consequency the disease risk attributed carrying a susceptibility allele is much smaller than for more Mendelian disease. Although the literature includes many studies which claim success, attempts at replicating the result for complex disease, in other samples collections, has often failed. Suggesting that many are false positive and the approach flawed. Risch wrote… Typically for common disease RR are thought to be < 1.5 or smaller. (But there may be many such variants.)
Complex diseases Relative risk (RR) Frequency Rare (e.g. <1%) Common (e.g. 5-50%) The mutations underlying common complex disease are composed of multiple mutations of modest effect Typically RR < 1.5
Successes and Failures Linkage studies aren’t the right study design for detecting complex trait effects. Number of families / case-control pairs needed Linkage study Case/control, GWAS study The approach we have just outlined has proven successful in identifying the genetic basis of many diseases in which a single, or small number of genes explain a large proportion of the incidence of disease in the population. Good examples include … The same approach has been applied to many more complex disease, but with much less success. The difference between complex disease, and so-called Mendelian disease, are that they are thought to be caused by multiple genes which potentially interact with one another, and/or with the environment. As a consequency the disease risk attributed carrying a susceptibility allele is much smaller than for more Mendelian disease. Although the literature includes many studies which claim success, attempts at replicating the result for complex disease, in other samples collections, has often failed. Suggesting that many are false positive and the approach flawed. Risch wrote… P( disease | risk allele ) Relative risk = P( disease | non-risk allele ) Risch (2000)
Finding Disease Genes 2 - GWAS Familial Aggregation Still want a heritable trait! Segregation Analysis Genome-wide Linkage Analysis Genome-wide Association Analysis Candidate Gene Studies + Fine Mapping Relatively recently a new approach to identifying the genes underlying common disease has become popular and had much more success. The shift away from linkage analysis in favour of the genome-wide association study had been facilitated and motivated by advances in three areas. Statistical theory. Genotyping technology. And human population genetics, namely the International HapMap project. In the next few slides we will look at these developments as there are important for understanding the key features of the genome-wide association study. Gene Characterization
Association mapping Cases (D) Chromosomes Controls (U) 1. Collect a set of unrelated affected individuals (cases) and unaffected individuals (controls).
So real effects, e.g. RR<1.5, are much more subtle than this! Association mapping Cases (D) Chromosomes Controls (U) Red variant is what we’re looking for – e.g. in this toy example, P(D|red) P(red|D) P(not red) RR = = = 5/6 * 5/6 / (1/6)*(1/6) = 25 P(D|not red) P(not red|D) P(red) So real effects, e.g. RR<1.5, are much more subtle than this!
* * * Association mapping Cases (D) Controls (U) * * * 2. Genotype many thousands of genetic markers (but probably not the causal, functional mutations themselves)
* * * Association mapping Cases (D) Controls (U) * * * 3. Hope to rely on correlations between typed markers and the causal mutations
Association mapping e.g in our toy example Not white white cases 5 1 controls 2 4 Frequency 1/6 2/3 => Estimate RR=10 at this marker SNP. Perform statistical test to test for evidence of difference in allele frequencies between cases and controls. (e.g. chi-squared test). In this toy example P=0.24 so not enough data even for this strong effect. P < (a stringent threshhold) => success!
(Aside - association studies – TDT) Collect (lots) of trios of individuals Condition on phenotype of offspring (case) High risk alleles should be over transmitted Internal control formed by untransmitted alleles A a A a a A
Difference between linkage and association Linkage studies - Collect set of families with individuals carrying disease or phenotype - Look for co-segregation of small number of markers with disease status. Association Studies - Collect unrelated individuals and look at allele frequency differences between cases and controls (or cases and parents for TDT) - Requires genotyping many thousands of markers. - Exploits correlations between nearby genetic diversity along chromosomes within the population
Theory Association studies provide more power allowing us to detect the small effect sizes underlying gene responsible for common disease. Questions How many SNPs would actually be needed to cover the genome? Can we actually type enough SNPs, and cheaply enough, for the large sample sizes required?
Tagging genetic diversity How many markers are actually required to tag the diversity? - To understand this, must first understand patterns of diversity in natural populations - Identify catalogue of variants to type Can we design experiments to analyse such large numbers of SNPs?
Correlation between SNPs Real data Correlation Previous prediction Expectation based on overall recombination rates. Reich et al Nature 2001 Physical distance along chromosome Reich et al Nature 2001
Why? - recombination hotspots Count the number of recombination in (lots) of sperm in the MHC region of chromosome 6 Jeffreys et al 1998
Hotspots are a genome wide feature More than 80% of recombination in less than 10% of the genome
Recombination gives LD a block-like structure
Discovery of over 5 million SNP across the genome HapMap project Consortium of a large number of scientist to conduct a study to catalogue and describe human genetic diversity Discovery of over 5 million SNP across the genome
HapMap project Consortium of a large number of scientist to conduct a study to catalogue and describe human genetic diversity Estimate that 200,000 to 500,000 SNPs require to tag genome (at least in European and Asian populations).
Competition drove technology improvements Affymetrix 100K Affymetrix 500K Affymetrix 6.0 (~1M SNPs) … Illumina 650Y Illumina 1M Illumina 2.5M Illumina 5M Which one to buy? Coverage Cost So a key decision is which SNPs to look at – currently 4 or 5 chips Differ in the strategies for choosing SNPs and cost
Costing a GWA Competition and anticipation of GWA association studies power drove cost of genotyping chips down Cost per genotype 2003 ~ $1 2005 ~ $0.1 2006 ~ $0.001 2009 ~ $0.0005 (ish) High throughput microchip arrays Main players Affymetrix and Illumina
Power to find weak effects Illumina 650k Illumina 550k Illumina 300k Affymetrix 500k Affymetrix 100k Power Relative risk of 1.2 Sample size (number of cases and controls)
Theory HapMap Technology Association studies provide more power allowing us to detect the small effect sizes underlying gene responsible for common disease HapMap Strong correlations between neighbouring SNP due to hotspots mean that we don’t necessarily need to type the causal variant Technology Competition and commercial drive has meant the we can now affordable type the necessary number of SNPs in large numbers of individuals
GWAS recipe 1. Collect large numbers of case individuals (1000s) 2. Collect large numbers of controls (perhaps randomly from the population). (3. Get consent) 4. Extract DNA 5. Genotype individuals at lots of markers 6. At each SNP do a test for allele frequency difference between cases and controls (chi-squared, logistic regression) 7. Look for small p-values (how small)?
It works! Study of ulcerative colitis (inflammatory bowel disease) This is a major endpoint of many analysis – a manhattan plot. Study of ulcerative colitis (inflammatory bowel disease) 2321 cases, 4,818 controls typed on Affy 6.0 array (~1M SNPs) There are now (2016) over 160 common SNPs with effects RR < 2 associated with IBD, accounting for ~20% of disease heritability
It works! Study of multiple sclerosis (2011) www.well.ox.ac.uk/wtccc2/ms Study of multiple sclerosis (2011) 9772 cases, 17,376 controls from across Europe
www.genome.gov/gwastudies/
What can possibly go wrong?
Genetic markers genotyped Association mapping Cases (D) Controls (U) Genetic markers genotyped * * *
Potential confounders Testing for small differences in allele frequency in large samples at around a million different SNPs in the genome Statistical tests are sensitive to possible confounding e.g. ?? Large amounts of data makes it difficult to visual inspect data
Some potential problems Population Structure Population differentiation – tends to affect all parts of the genome Natural selection – has pronounced effect at particular loci Experimental biases Subtle difference in the DNA collection, storage or analysis can lead to both consistent and sporadic differences
Confounding by population structure Subpopulation A Subpopulation B Cases Cases Controls Controls 2 = 2.1 (p = 0.34) 2 = 1.57 (p = 0.46) 2 = 16.3 (p <0.001) Genotype aa Aa AA
SNP genotyping SNP genotyping is achieved by measuring the evidence for the presence of the two alleles at each SNP in each individuals independently Genotypes are then obtained by “clustering” the data This is hard! Intensity of probe B Intensity of probe A
Differences in genotype calling Cases Controls The experimental process is not perfect and slight differences can lead to apparent allele frequency differences
An embarrassing example Plausible hypothesis, big study, genome-wide markers, very small P-value (< 1x10-10). In a respected journal (Science)... But not real, and now retracted. Why – because of genotyping errors!
A quick example to demonstrate some of the analytical and statistical challenges…