Multifactorial traits and complex genetics I

Slides:



Advertisements
Similar presentations
Linkage and Genetic Mapping
Advertisements

Lecture 2 Strachan and Read Chapter 13
What is an association study? Define linkage disequilibrium
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Genetic Analysis in Human Disease
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Genetic Traits Quantitative (height, weight) Dichotomous (affected/unaffected) Factorial (blood group) Mendelian - controlled by single gene (cystic fibrosis)
2050 VLSB. Dad phase unknown A1 A2 0.5 (total # meioses) Odds = 1/2[(1-r) n r k ]+ 1/2[(1-r) n r k ]odds ratio What single r value best explains the data?
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
Multifactorial Traits
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
CS177 Lecture 10 SNPs and Human Genetic Variation
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
An quick overview of human genetic linkage analysis
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
An quick overview of human genetic linkage analysis Terry Speed Genetics & Bioinformatics, WEHI Statistics, UCB NWO/IOP Genomics Winterschool Mathematics.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Brendan Burke and Kyle Steffen. Important New Tool in Genomic Medicine GWAS is used to estimate disease risk and test SNPs( the most common type of genetic.
Interpreting exomes and genomes: a beginner’s guide
Single Nucleotide Polymorphisms (SNPs
SNPs and complex traits: where is the hidden heritability?
Genomic Analysis: GWAS
Common variation, GWAS & PLINK
MULTIPLE GENES AND QUANTITATIVE TRAITS
Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Unit 3.
Quantitative traits Lecture 13 By Ms. Shumaila Azam
Genome Wide Association Studies using SNP
Migrant Studies Migrant Studies: vary environment, keep genetics constant: Evaluate incidence of disorder among ethnically-similar individuals living.
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Two copies of each autosomal gene affect phenotype.
Recombination (Crossing Over)
Genes may be linked or unlinked and are inherited accordingly.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
High level GWAS analysis
Recommended Reading: Chapter 12 of OpenStax
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
Different mode and types of inheritance
Power to detect QTL Association
MULTIPLE GENES AND QUANTITATIVE TRAITS
Genome-wide Associations
Genome-wide Association Studies
Beyond GWAS Erik Fransen.
Linking Genetic Variation to Important Phenotypes
Topic 10.2 Inheritance.
Chapter 12 Table of Contents Section 1 Chromosomes and Inheritance
Chapter 7 Multifactorial Traits
Unit 5: Heredity Review Lessons 1, 3, 4 & 5.
Medical genomics BI420 Department of Biology, Boston College
Lecture 9: QTL Mapping II: Outbred Populations
Perspectives from Human Studies and Low Density Chip
Medical genomics BI420 Department of Biology, Boston College
10.2 Inheritance Skills: Calculation of the predicted genotypic and phenotypic ratio of offspring of dihybrid crosses involving unlinked autosomal genes.
GWAS-eQTL signal colocalisation methods
Following Patterns of Inheritance in Humans
10.2 Inheritance Skills: Calculation of the predicted genotypic and phenotypic ratio of offspring of dihybrid crosses involving unlinked autosomal genes.
Presentation transcript:

Multifactorial traits and complex genetics I Genome-wide association studies in humans gavin.band@well.ox.ac.uk Wellcome Trust Centre for Human Genetics The title of my talk is genome-wide studies as this has been the main focus of my research over the last 3 years.

Overview Describe studies aiming to find genetic differences between individuals that influence susceptibility to diseases (or other traits).

Why find disease genes? Identify putative drug targets. Identify high risk individuals. Gene therapy? Personalised medicine (e.g. stratifying cancer) Understand the biology of disease.

How do genetic factors influence traits? Two somewhat competing views Genetic influence on traits is inherited in big, discrete lumps “Mendelian inheritance” - Gregor Mendel (1865) - Morgan (1915) - e.g. Discovery of ABO blood group (1924) Genetic influence on traits is inherited in essentially continuous quantities “biometrical”, “multifactorial”, “polygenic” viewpoint - Darwin 1859 - Galton 1886 (e.g. human height) There were two original competing views about inheritance (before we knew anyhting about genetics.) Although these were synthesised mid-1900s by Fisher and others, these views in combination with technological aspects have still somewhat influenced disease genetics studies. Fisher, Haldane, Wright 1920s-1930s The modern evolutionary synthesis

Genomics timeline 1950’s – structure of DNA 1970’s – ‘Sanger sequencing’ 1980’s – RFLP (genetic barcode / inexpensive genotyping of marker loci) 1990’s – Linkage studies using RFLPs 2000’s – Human Genome Project completed; International HapMap project; first genotyping microarrays; first large-scale association studies. 2010’s – 1000 Genomes Project; direct-to-consumer genetic testing Present – Massively large-scale biobank / population sequencing projects (UK Biobank), 100 000 genomes project (UK); Precision Medicine Initiative (US), …

Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis Genome-wide Linkage Analysis Until recently the standard approach to finding disease genes consist of several key stages common to the majority of studies. Primarily, for a disease or phenotype to be suitable for genetic analysis it has to be shown to heritable. For example, it may be observed that incidences of a disease tends to cluster in families. Analysis of family aggregation is used to demonstrate that an individuals chance of having a disease phenotype increases when a close relative also exhibits the same phenotype. The analysis of twins or of adopted individuals can be used to rule out common environmental factors as being responsible, leaving a genetic component to disease the most likely explanation. One this has been established, segregation analysis can help assess the relative likelihood of different models of disease inheritance. Analysis of the segregation of disease phenotypes within pedigrees provides the first hint of the number of genes likely to be underlying the phenotype, and how penetrant they might be; but it cannot provide information on which genes are responsible. Up to this point no genetic data has been collected, and most of the statistical analysis involves summing over the possible genotypes of individuals within families. By genotyping genetic markers within families, linkage analysis aims to infer the physical location of disease causing genes along chromosomes by comparing patterns of disease status within families, to the patterns of inheritance of genetic markers. Strong correlations between disease phenotypes and marker inheritance provides evidence that the casual mutation lies nearby. For example…

Linkage Mapping Small number of typed markers ABC abc abc A chromosome A/a B/b C/c … ABC abc abc A chromosome ABC abc abc = Affected = Unaffected Males – squares Females - circles aBC abc abc ABC abc Abc abc abC abc ABc abc ABC abc

Linkage Mapping Typical result if successful – a strong signal (good) but not well localised within a chromosome. Initial discovery led to finding of APOE variants affecting risk of Alzheimers. However, as there are only a handful of recombination events with in a family along each chromosome the size of the region in which the it is possible the disease causing mutation lies is often a sizable fraction of the chromosome, potentially containing hundreds of genes. This study collected 32 extended families (!) and localised a signal to somewhere on chromosome 19. But more work was needed. chromosome Pericak-Vance et al, Am. J. Hum. Gen (1991)

Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis Genome-wide Linkage Analysis Candidate Gene Studies + Fine Mapping For this reason, the next step in the process is to type more markers either in larger pedigrees or in unrelated individuals, either across the region, or in genes which are candidates due to their function. If successful this fine-scale analysis will elucidate the genes responsible for the linkage peak, and efforts would then focus on understanding the biology of the genetics, often in animal models, with the aim of developing therapeutics. Gene Characterization

Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis We aren’t very good at this! Genome-wide Linkage Analysis Candidate Gene Studies + Fine Mapping For this reason, the next step in the process is to type more markers either in larger pedigrees or in unrelated individuals, either across the region, or in genes which are candidates due to there function. If successful this fine-scale analysis will elucidate the genes responsible for the linkage peak, and efforts would then focus on understanding the biology of the genetics, often in animal models, with the aim of developing therapeutics. Gene Characterization

Successes and Failures Linkage Mapping has been successful in identifying the genetic basis of many human diseases in which the disease penetrance resembles a simple Mendelian model e.g. Huntington’s disease (HD 1993), Cystic Fibrosis, some forms of breast cancer (BRCA1 1993), Alzheimers (APOE 1991)… But “the literature is now replete with linkage screens for an array of common ‘complex’ disorders such as schizophrenia, manic depression, autism, asthma, type I and type II diabetes, Multiple Sclerosis, Lupus. Although many of these studies have reported significant linkage findings, none has lead to convincing replication” – Risch (2000) The approach we have just outlined has proven successful in identifying the genetic basis of many diseases in which a single, or small number of genes explain a large proportion of the incidence of disease in the population. Good examples include … The same approach has been applied to many more complex disease, but with much less success. The difference between complex disease, and so-called Mendelian disease, are that they are thought to be caused by multiple genes which potentially interact with one another, and/or with the environment. As a consequency the disease risk attributed carrying a susceptibility allele is much smaller than for more Mendelian disease. Although the literature includes many studies which claim success, attempts at replicating the result for complex disease, in other samples collections, has often failed. Suggesting that many are false positive and the approach flawed. Risch wrote…

Successes and Failures Why? It’s because linkage studies aren’t the right study design for detecting non-Mendelian-like effects. These so-called ‘complex’ traits have fundamentally different genetic architectures. P( disease | risk allele ) Relative risk = P( disease | non-risk allele ) ‘Mendelian’-like trait => RR > 4 or so, i.e. you are many times more likely to get disease if you are a risk allele carrier. The approach we have just outlined has proven successful in identifying the genetic basis of many diseases in which a single, or small number of genes explain a large proportion of the incidence of disease in the population. Good examples include … The same approach has been applied to many more complex disease, but with much less success. The difference between complex disease, and so-called Mendelian disease, are that they are thought to be caused by multiple genes which potentially interact with one another, and/or with the environment. As a consequency the disease risk attributed carrying a susceptibility allele is much smaller than for more Mendelian disease. Although the literature includes many studies which claim success, attempts at replicating the result for complex disease, in other samples collections, has often failed. Suggesting that many are false positive and the approach flawed. Risch wrote… Typically for common disease RR are thought to be < 1.5 or smaller. (But there may be many such variants.)

Complex diseases Relative risk (RR) Frequency Rare (e.g. <1%) Common (e.g. 5-50%) The mutations underlying common complex disease are composed of multiple mutations of modest effect Typically RR < 1.5

Successes and Failures Linkage studies aren’t the right study design for detecting complex trait effects. Number of families / case-control pairs needed Linkage study Case/control, GWAS study The approach we have just outlined has proven successful in identifying the genetic basis of many diseases in which a single, or small number of genes explain a large proportion of the incidence of disease in the population. Good examples include … The same approach has been applied to many more complex disease, but with much less success. The difference between complex disease, and so-called Mendelian disease, are that they are thought to be caused by multiple genes which potentially interact with one another, and/or with the environment. As a consequency the disease risk attributed carrying a susceptibility allele is much smaller than for more Mendelian disease. Although the literature includes many studies which claim success, attempts at replicating the result for complex disease, in other samples collections, has often failed. Suggesting that many are false positive and the approach flawed. Risch wrote… P( disease | risk allele ) Relative risk = P( disease | non-risk allele ) Risch (2000)

Finding Disease Genes 2 - GWAS Familial Aggregation Still want a heritable trait! Segregation Analysis Genome-wide Linkage Analysis Genome-wide Association Analysis Candidate Gene Studies + Fine Mapping Relatively recently a new approach to identifying the genes underlying common disease has become popular and had much more success. The shift away from linkage analysis in favour of the genome-wide association study had been facilitated and motivated by advances in three areas. Statistical theory. Genotyping technology. And human population genetics, namely the International HapMap project. In the next few slides we will look at these developments as there are important for understanding the key features of the genome-wide association study. Gene Characterization

Association mapping Cases (D) Chromosomes Controls (U) 1. Collect a set of unrelated affected individuals (cases) and unaffected individuals (controls).

So real effects, e.g. RR<1.5, are much more subtle than this! Association mapping Cases (D) Chromosomes Controls (U) Red variant is what we’re looking for – e.g. in this toy example, P(D|red) P(red|D) P(not red) RR = = = 5/6 * 5/6 / (1/6)*(1/6) = 25 P(D|not red) P(not red|D) P(red) So real effects, e.g. RR<1.5, are much more subtle than this!

* * * Association mapping Cases (D) Controls (U) * * * 2. Genotype many thousands of genetic markers (but probably not the causal, functional mutations themselves)

* * * Association mapping Cases (D) Controls (U) * * * 3. Hope to rely on correlations between typed markers and the causal mutations

Association mapping e.g in our toy example Not white white cases 5 1 controls 2 4 Frequency 1/6 2/3 => Estimate RR=10 at this marker SNP. Perform statistical test to test for evidence of difference in allele frequencies between cases and controls. (e.g. chi-squared test). In this toy example P=0.24 so not enough data even for this strong effect. P < (a stringent threshhold) => success!

(Aside - association studies – TDT) Collect (lots) of trios of individuals Condition on phenotype of offspring (case) High risk alleles should be over transmitted Internal control formed by untransmitted alleles A a A a a A

Difference between linkage and association Linkage studies - Collect set of families with individuals carrying disease or phenotype - Look for co-segregation of small number of markers with disease status. Association Studies - Collect unrelated individuals and look at allele frequency differences between cases and controls (or cases and parents for TDT) - Requires genotyping many thousands of markers. - Exploits correlations between nearby genetic diversity along chromosomes within the population

Theory Association studies provide more power allowing us to detect the small effect sizes underlying gene responsible for common disease. Questions How many SNPs would actually be needed to cover the genome? Can we actually type enough SNPs, and cheaply enough, for the large sample sizes required?

Tagging genetic diversity How many markers are actually required to tag the diversity? - To understand this, must first understand patterns of diversity in natural populations - Identify catalogue of variants to type Can we design experiments to analyse such large numbers of SNPs?

Correlation between SNPs Real data Correlation Previous prediction Expectation based on overall recombination rates. Reich et al Nature 2001 Physical distance along chromosome Reich et al Nature 2001

Why? - recombination hotspots Count the number of recombination in (lots) of sperm in the MHC region of chromosome 6 Jeffreys et al 1998

Hotspots are a genome wide feature More than 80% of recombination in less than 10% of the genome

Recombination gives LD a block-like structure

Discovery of over 5 million SNP across the genome HapMap project Consortium of a large number of scientist to conduct a study to catalogue and describe human genetic diversity Discovery of over 5 million SNP across the genome

HapMap project Consortium of a large number of scientist to conduct a study to catalogue and describe human genetic diversity Estimate that 200,000 to 500,000 SNPs require to tag genome (at least in European and Asian populations).

Competition drove technology improvements Affymetrix 100K Affymetrix 500K Affymetrix 6.0 (~1M SNPs) … Illumina 650Y Illumina 1M Illumina 2.5M Illumina 5M Which one to buy? Coverage Cost So a key decision is which SNPs to look at – currently 4 or 5 chips Differ in the strategies for choosing SNPs and cost

Costing a GWA Competition and anticipation of GWA association studies power drove cost of genotyping chips down Cost per genotype 2003 ~ $1 2005 ~ $0.1 2006 ~ $0.001 2009 ~ $0.0005 (ish) High throughput microchip arrays Main players Affymetrix and Illumina

Power to find weak effects Illumina 650k Illumina 550k Illumina 300k Affymetrix 500k Affymetrix 100k Power Relative risk of 1.2 Sample size (number of cases and controls)

Theory HapMap Technology Association studies provide more power allowing us to detect the small effect sizes underlying gene responsible for common disease HapMap Strong correlations between neighbouring SNP due to hotspots mean that we don’t necessarily need to type the causal variant Technology Competition and commercial drive has meant the we can now affordable type the necessary number of SNPs in large numbers of individuals

GWAS recipe 1. Collect large numbers of case individuals (1000s) 2. Collect large numbers of controls (perhaps randomly from the population). (3. Get consent) 4. Extract DNA 5. Genotype individuals at lots of markers 6. At each SNP do a test for allele frequency difference between cases and controls (chi-squared, logistic regression) 7. Look for small p-values (how small)?

It works! Study of ulcerative colitis (inflammatory bowel disease) This is a major endpoint of many analysis – a manhattan plot. Study of ulcerative colitis (inflammatory bowel disease) 2321 cases, 4,818 controls typed on Affy 6.0 array (~1M SNPs) There are now (2016) over 160 common SNPs with effects RR < 2 associated with IBD, accounting for ~20% of disease heritability

It works! Study of multiple sclerosis (2011) www.well.ox.ac.uk/wtccc2/ms Study of multiple sclerosis (2011) 9772 cases, 17,376 controls from across Europe

www.genome.gov/gwastudies/

What can possibly go wrong?

Genetic markers genotyped Association mapping Cases (D) Controls (U) Genetic markers genotyped * * *

Potential confounders Testing for small differences in allele frequency in large samples at around a million different SNPs in the genome Statistical tests are sensitive to possible confounding e.g. ?? Large amounts of data makes it difficult to visual inspect data

Some potential problems Population Structure Population differentiation – tends to affect all parts of the genome Natural selection – has pronounced effect at particular loci Experimental biases Subtle difference in the DNA collection, storage or analysis can lead to both consistent and sporadic differences

Confounding by population structure Subpopulation A Subpopulation B Cases Cases Controls Controls 2 = 2.1 (p = 0.34) 2 = 1.57 (p = 0.46) 2 = 16.3 (p <0.001) Genotype aa Aa AA

SNP genotyping SNP genotyping is achieved by measuring the evidence for the presence of the two alleles at each SNP in each individuals independently Genotypes are then obtained by “clustering” the data This is hard! Intensity of probe B Intensity of probe A

Differences in genotype calling Cases Controls The experimental process is not perfect and slight differences can lead to apparent allele frequency differences

An embarrassing example Plausible hypothesis, big study, genome-wide markers, very small P-value (< 1x10-10). In a respected journal (Science)... But not real, and now retracted. Why – because of genotyping errors!

A quick example to demonstrate some of the analytical and statistical challenges…