Introduction to Genome Wide Association Studies Saharon Rosset Tel Aviv University.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

What is an association study? Define linkage disequilibrium
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
GWAS: Installing and Testing
Association Tests for Rare Variants Using Sequence Data
Admixture in Horse Breeds Illustrated from Single Nucleotide Polymorphism Data César Torres, Yaniv Brandvain University of Minnesota, Department of Plant.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Genetic Analysis in Human Disease
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
MALD Mapping by Admixture Linkage Disequilibrium.
THE RELATIONSHIP BETWEEN GENOTYPE AND PHENOTYPE – WHAT WE KNOW AND WHAT WE DON’T KNOW. PAUL SCHLIEKELMAN DEPARTMENT OF STATISTICS UNIVERSITY OF GEORGIA.
Office hours Wednesday 3-4pm 304A Stanley Hall. Fig Association mapping (qualitative)
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
FINAL EXAM: TAKE-HOME Assessment of Significance in Cancer Gene SNPs.
Chapter 5 Human Heredity by Michael Cummings ©2006 Brooks/Cole-Thomson Learning Chapter 5 Complex Patterns of Inheritance.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Kaitlyn Cook Carleton College Northfield Undergraduate Mathematics Symposium October 7, 2014 A METHOD FOR COMBINING FAMILY-BASED RARE VARIANT TESTS OF.
Presentation 12 Chi-Square test.
Chapter 7 Multifactorial Traits
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Rare and common variants: twenty arguments G.Gibson Homework 3 Mylène Champs Marine Flechet Mathieu Stifkens 1 Bioinformatics - GBIO K.Van Steen.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Multifactorial Traits
Association study of 5-HT2A genes with schizophrenia in the Malaysian population: A Multiethnic Meta- analysis Study Shiau Foon Tee* 1, Tze Jen Chow 1,
The Pie of Schizophrenia (Theoretical: Early Molecular Biology)
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Pharmacogenetics & Pharmacogenomics Personalized Medicine.
CS177 Lecture 10 SNPs and Human Genetic Variation
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Personalized Medicine Dr. M. Jawad Hassan. Personalized Medicine Human Genome and SNPs What is personalized medicine? Pharmacogenetics Case study – warfarin.
genetic variation is meaningful only in the context of a population
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
An quick overview of human genetic linkage analysis
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Genome wide association studies (A Brief Start)
The International Consortium. The International HapMap Project.
An quick overview of human genetic linkage analysis Terry Speed Genetics & Bioinformatics, WEHI Statistics, UCB NWO/IOP Genomics Winterschool Mathematics.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Brendan Burke and Kyle Steffen. Important New Tool in Genomic Medicine GWAS is used to estimate disease risk and test SNPs( the most common type of genetic.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
University of Colorado at Boulder
Interpreting exomes and genomes: a beginner’s guide
SNPs and complex traits: where is the hidden heritability?
Gil McVean Department of Statistics
Bioinformatics challenges for “genome-wide association” studies
Genome Wide Association Studies using SNP
Human Cells Human genomics
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Genome-wide Associations
Beyond GWAS Erik Fransen.
Complex Traits Qualitative traits. Discrete phenotypes with direct Mendelian relationship to genotype. e.g. black or white, tall or short, sick or healthy.
Chapter 7 Multifactorial Traits
Statistical Analysis and Design of Experiments for Large Data Sets
Discovery From Data Repositories H Craig Mak  Nature Biotechnology 29, 46–47 (2011) 2013 /06 /10.
Six W’s of Genetic Testing
Presentation transcript:

Introduction to Genome Wide Association Studies Saharon Rosset Tel Aviv University

Genome wide association studies Goal: find connections between: A phenotype: height, type-I diabetes, etc., known to be heritable Whole-genome genotype Specific goals are distinct: 1.Identify statistical connections between points (or areas) in the genome and the phenotype Drive hypotheses for biological studies of specific genes/regions in specific context 2.Generate insights on genetic architecture of phenotype Many small genetic effects dispersed across the genome? Few large effects concentrated in one area (MHC?) 3.Build statistical models to predict phenotype from genotype “Show me your genome and I will tell you what diseases you will get”

Methodology Collect n subjects with known phenotype (usually n in range ) Measure each one in m genomic locations (“representing common variation in the whole genome”) Usually SNPs: Single Nucleotide Polymorphisms Typically m in range Recently moving to whole genome sequencing (m = 3*10 9 but realistically same information) Now we can think of our data as X n*m matrix with subjects as rows, SNPs as columns, X ij is in {0,1,2} (genotype at single locus) Also given extra vector Y n of phenotypes Our first task: association testing Find SNPs (columns in X) that are statistically associated with Y Can be thought of as m separate statistical tests run on this matrix

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC Cases: Controls: Associated SNP Can you find the associated SNP?

Disease association analysis of a single SNP TotalGenotype 2Genotype 1Genotype 0 N0N0 N 02 N 01 N 00 Y=0 (healthy) N1N1 N 12 N 11 N 10 Y=1 (sick) nM2M2 M1M1 M0M0 Total Now our problem is one of testing: H 0 : No connection between disease and SNP  the rows and columns of the table are independent Obvious approach: χ 2 test for 3x2 table (2-df) Other alternatives: logistic regression, trend test,… (dealing with genotype as numeric) This approach generates m (≈10 6 ) total hypotheses tests and p values

“Manhattan plot” of GWAS results What happens if we use a p-value threshold of α=0.05 (black line) to declare results as significant? We would get about 10 6 x0.05 = 50K false discoveries Solution: be very selective in what results we declare as significant. In this plot the threshold is the orange line at α=10 -5  Declaring only one association in Chr7

The multiplicity problem in GWAS What is a statistically sound choice of a threshold for declaring an association? Family wise error rate (FWER): the probability of making even one false discovery out of our m tests Controlling FWER: the well known Bonferroni correction, perform each test at level α = 0.05/m For m = 10 6 this gives α = 5 x Leading journals (Nature Genetics) require a p value smaller than 5 x to publish GWAS results Implicitly require Bonferroni for 10 6 – super conservative! Lesson learned in blood, from findings that did not replicate and were eventually deemed false!

GWAS promise and history We know of many highly heritable traits and diseases including Height Heart Disease Many cancers The GWAS promise: we will identify the genetic basis for this heritability First GWAS in 2005, since then: Thousands of studies, hundreds of thousands of individuals, hundreds of billions of SNPs genotyped, many billions of $$$ invested Was the promise fulfilled?

Was the promise fulfilled? Yes and no!

Yes: we found a lot of associations, learned some biology Lessons learned: A few of strongest associations are in coding regions Most associations are in regulatory elements Some are in gene deserts

Results of famous WTCCC study of seven diseases on 14,000 cases and 3,000 shared controls (Nature, 2007) Total found: 13 significant findings at level 5*10 -8

Not al all: where is all the heritability?

Our GWAS findings do not explain heritability Height: From twins and family study, about 80% of height variability is heritable Huge height GWAS (n>40K ) found SNPs explaining ~10% of height variability Diseases: Schizophrenia, heart disease, cancers,… Heritability: 30%-80% For none of these, GWAS gives more than 5%-10% Basically, for all complex traits investigated a major gap remains!

Results of famous WTCCC study of seven diseases on 14,000 cases and 3000 shared controls (Nature, 2007) Total found: 13 significant findings at level 5*10 -8 Heritability explained: small for all except T1D

Where is the missing heritability? Theories: 1.Rare variants not covered by GWAS : Every family has its own mutation We know some examples in cancer (BRCA) 2.Complex associations/epistasis: combinations of SNPs Problem: 10 6 SNPs is pairs 3.Lack of power: the effects are weak, we need much more data Or statistical approaches that aggregate more smartly 4.Epigenetic effects: heritability is not in the genome at all To some extent, all these theories have been tested, some have provided interesting answers (still hotly debated)

The importance of genetic structure Genetic structure: not everyone in the population is from same genetic background Some people are more genetically similar than others Israel: Ashkenazi Jews, Mizachi Jews, Arabs,… US: Caucasian, Black, Hispanic Particularly interesting: admixed populations African/Hispanic Americans: mixture of African, European and Native American ancestry Proportions may vary significantly between “African American” individuals Many SNPs in the genome have different distribution between Africans and Europeans Most not due to selection/adaptation but due to random drift

Genetic structure and GWAS Many traits have strong population association In the US, diabetes much more common among blacks In Israel, Crohn’s disease is much more common among Ashkenazi Jews Now, say that we sampled diabetes cases in some hospitals in US + controls in the same hospitals, performed GWAS % of blacks in cases will be higher than in controls (because of high prevalence) What will our GWAS show? Every SNP which differs in distribution between Europeans and Africans will be statistically associated with the disease Only because of structure/stratification in our sample!

J Novembre et al. Nature 000, 1-4 (2008) doi: /nature07331 Even homogeneous population has some structure: Genes mirror geography within Europe

GWAS: controlling for structure, using structure We are seeking associations that are not “due to structure” How can we eliminate ones that are due to it? If we know who is white and who is black, we can do an analysis that controls for the “race” variable For example, logistic regression with both the race and the SNP as predictors What happens if we don’t know? We just saw that structure can be automatically extracted even from “homogeneous” data We can extract it, then control for it This is what modern GWAS analyses do

Using structure in a cool way: Admixture mapping Assume we have: A disease that is more common in Africans than Europeans, say: early onset kidney disease (<40) A population that is an admixture of European and African, like African Americans Suggestion: find the genetic cause by examining genomes of sick admixed individuals The area of the genome where the genetic cause resides will be more “African” in the cases than the rest of their genome We don’t necessarily need controls for this analysis

Figure 2 Discovering associations with a disease through admixture mapping Rosset, S. et al. (2011) The population genetics of chronic kidney disease: insights from the MYH9–APOL1 locus Nat. Rev. Nephrol. doi: /nrneph Permission obtained from Nature Publishing Group. Darvasi, A. & Shifman, S. Nat. Genet. 37, 118–119 (2005)

Genetic risk prediction from GWAS The vision, the doctor will have a “desktop predictor” Input: patient’s genome Output: risk for one (or many) diseases Building prediction models is a very different use of GWAS information Non-genetic risk factors that are correlated with the genome (like diet) are also legitimate for prediction Don’t need to name the SNPs that are responsible for risk (  can use structure) Don’t necessarily need a biologist in the loop We have accumulating evidence that we may be able to do much better prediction than our identified significant associations only can offer Advanced methods can take advantage of weaker associations, signal from rare variants, environmental effects, etc.

Summary GWAS is a modern “Big Data” challenge Proper analysis is a major statistical/methodological challenge, e.g.: Controlling and using structure Finding complex associations We have learned a lot – but not as much as we hoped We are still improving on both major fronts: Size and extent of data available Advanced statistical methods

Thank you!