Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

Association Tests for Rare Variants Using Sequence Data
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Ferdinand van ’t Hooft Cardiovascular Genetics and Genomics Group Karolinska Institutet, Stockholm, Sweden Genome-Wide Association Study GWAS
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
1 FSTL4 and SEMA5A are associated with alcohol dependence: meta- analysis of two genome-wide association studies Kesheng Wang, PhD Department of Biostatistics.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Genetic Traits Quantitative (height, weight) Dichotomous (affected/unaffected) Factorial (blood group) Mendelian - controlled by single gene (cystic fibrosis)
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
The Center for Medical Genomics facilitates cutting-edge research with state-of-the-art genomic technologies for studying gene expression and genetics,
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
Figure S1. Quantile-quantile plot in –log10 scale for the individual studies The red line represents concordance of observed and expected values. The shaded.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
Genome-Wide Association Study (GWAS)
Two RANTES gene polymorphisms and their haplotypes in patients with myocardial infarction from two Slavonic populations Two RANTES gene polymorphisms and.
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Quantitative Genetics
Jeff O’ConnellInterbull annual meeting, Orlando, FL, July 2015 (1) J. R. O’Connell 1 and P. M. VanRaden 2 1 University of Maryland School of Medicine,
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Genome wide association studies (A Brief Start)
The International Consortium. The International HapMap Project.
C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Statistical Analysis of Candidate Gene Association Studies (Categorical Traits) of Biallelic Single Nucleotide Polymorphisms Maani Beigy MD-MPH Student.
SCANNING OF CANDIDATE GENES FOR THE SUSCEPTIBILITY OF KAWASAKI DISEASE IN THE HLA REGION Lee JK, Kim JJ, Kim S, Choi IH, Kim KJ, Hong SJ, Seo EJ, Yoo HW,
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
SNPs and complex traits: where is the hidden heritability?
Genome Wide Association Studies using SNP
High level GWAS analysis
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
Genome-wide Associations
Genome-wide Association Studies
Perspectives from Human Studies and Low Density Chip
Presentation transcript:

Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical Care Medicine Brigham and Women’s Hospital Boston, Massachusetts Department of Biostatistics Harvard School of Public Health Boston, Massachusetts

Overview: What are genome-wide association studies? What are genome-wide association studies? What are the statistical requirements for a successful What are the statistical requirements for a successful genome-wide association study? genome-wide association study? Sufficient sample sizes Sufficient sample sizes LD coverage LD coverage Genotype quality Genotype quality Design of genome-wide association studies / Design of genome-wide association studies / Handling of the multiple testing problem Handling of the multiple testing problem

The human genome 22 chromosomes many possible genes ~30,000-50,000 genes ~8,000,000 SNPs How can we find disease genes?

The human genome How can we find disease genes? Genotyping all loci is not possible (not yet! ) => Utilization of 2 concepts: 1.) Linkage disequilibrium (LD): Correlation of alleles at two loci 2.) Genetic association: a particular form of a DNA polymorphism occurs more frequently in subjects with a phenotype of interest

Genetic Association DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus

Genome-wide association study Definition: Association analysis performed with a panel of polymorphic markers adequately spaced to capture most of the linkage disequilibrium information in the entire genome in the study population. Usually: 100,000 SNPs and more Human Genome Disease Phenotype ? => Test for association

What are the statistical requirements for a successful genome-wide association study? Sufficient sample sizes Sufficient sample sizes LD coverage LD coverage Genotyping quality Genotyping quality Design of genome-wide association studies / Design of genome-wide association studies / Handling of the multiple testing problem Handling of the multiple testing problem

Sample size requirements: DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus Sufficient statistical power is needed to detect the association

Example for required sample sizes Allele freqOdds ratio ,8592,6081, ,2831, ,2811, ,8861, Required sample sizes to achieve 80% power in a case/control study for a significance level of 10 -7

What are the statistical requirements for a successful genome-wide association study? Sufficient sample sizes Sufficient sample sizes LD coverage LD coverage Genotyping quality Genotyping quality Design of genome-wide association studies / Design of genome-wide association studies / Handling of the multiple testing problem Handling of the multiple testing problem

Linkage disequilibrium (LD): DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus The set of markers has to contain a marker that is “sufficiently” correlated with the DSL so that the genetic association at the DSL is also visible that the marker locus

Measures of genetic correlation between markers NameMeasureFormula Lewontin’s D’D’D AB / D max Hill&Weir (1994) R 2 or Δ 2 D AB 2 /{p A p B (1-p A )(1-p B )} Levin (1953) δ D AB /{p B p ab } Yule’s Q (1900) Q,yD AB /{p AA p BB+ p Ab p aB }

The interpretation of r^2 r 2 N is the “effective sample size” If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r 2 N cases and controls that directly measured G Goal: The markers that are genotyped should be selected so that they have high r^2-values (preferable at least 80%) with the marker that are not genotyped A good SNPs selection will be key for the success of GWAs

SNP Selection for GWA Studies Really a challenge for industry development, not an investigator’s laboratory However, need to select a panel with adequate LD coverage for study population Assessment of Illumina Sentrix HumanHap300 BeadChip (R. Lazarus) –Studied LD coverage of ENCODE regions: Ten 500 kb regions that were completely sequenced in HapMap in 60 CEPH parents –Assessed LD coverage of 6226 common ENCODE regions SNPs (MAF > 0.1) –Found maximum r 2 of each ENCODE SNP with a SNP on HumanHap300 Panel

Genotyping quality (QC): DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus The genotype quality has to be sufficient to so that the genetic association at the DSL is also visible that the marker locus that are in LD with the DSL.

For example, the dependence of the power of a GWA on the call rate Scenario: Case/control study: 1,500 cases & controls Odds-ratio: 1.5 Overall significance level: 5% Adjustment for multiple comparisons: Bonferroni 5%/500,000 = => Power as a function of allele frequency and call rates

Power levels and avg number of false positives: Avg call rate by genotype: 100%, 100%,100% Allele freqPowerAvg # false positives % % % %0.18

Power levels and avg number of false positives: Avg call rate by genotype: 99%, 99%, 99% Allele freqPowerAvg # false positives % % % %908.12

Power levels and avg number of false positives: Avg call rate by genotype: 98%, 98%, 98% Allele freqPowerAvg # false positives % % % %

Power levels and avg number of false positives: Avg call rate by genotype: 99%, 95%, 99% Allele freqPowerAvg # false positives % % % %

For example, the dependence of the power of a GWA on the call rate Conclusion: Call rate has moderate effect on power (for nearly perfect call rates) Call rate has large effect on number of false positives (for nearly perfect call rates) Situation even worse for multi-stage designs!

Genotyping quality (QC): DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus The genotype quality has to be sufficient so that false positive rate does not dilute the “real” signals

Design of genome-wide association studies/Handling of the multiple testing problem Design of genome-wide association studies/Handling of the multiple testing problem:

“Using the same data set for screening and testing”: An approach for family-based designs Balance false-negatives with false-positives We don’t want to test all SNPs –“You break it, you buy it” –Genomic screening and testing using the same data set Test the “promising” SNPs Ignore the “less-promising” SNPs

PBAT PBAT* screening approach –Family-based studies, quantitative traits –Address multiple-comparisons –Screen and test using the same dataset *Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:

PBAT: Screening Step 1. Screen –Use ‘between-family’ information E(X|S) to estimate the strength of the genetic association –Based on the estimate a b, calculate conditional power for –Select top N SNPs on the basis of power

PBAT: Testing Step 2. Test –Use ‘within-family’ information FBAT statistic (independent of ‘between-family’ info) –Adjust for N tests (not 500K!)

The 3 steps of the screening technique (Nature Genetics (2005)): Trait SNP 1SNP 2SNP 3SNP 4SNP 5SNP 6 Step 1: Replace X by E(X) and estimate power/effect size 15%89%35% 23%15% Step 2: Select combination with maximal power 85% Step 3: Replace E(X) by X and compute FBAT test statistic for SNP2 and Trait P-value for FBAT statistic: 0.5% This p-value does not need to be adjusted for multiple comparisons!!! E(X1|P)E(X2|P)E(X3|P)E(X4|P)E(X5|P)E(X6|P)

PBAT Software implementation –family-based studies –quantitative traits & dichotomous traits –Single marker, haplotype, multi-marker –Time-to-onset, multivariate data, time-series data –Professional version distributed by Golden Helix…

Golden Helix Software for Illumina Whole Genome Analysis Golden Helix is Harvard’s PBAT commercialization partner –Easy-to-use, user-friendly graphical interface –Professional PBAT training and consulting –Rapid customer support “Accelerating the Quest for Significance” –Powerful methods for both family and unrelated individuals –Run on hundreds of processors with distributed computing –Illumina data import directly supported –“I was able to do in 3 days what it has taken our lab 2 years to try and do with [other] collaborations.” – Golden Helix customer