Statistical Issues in Genetic Association Studies

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

What is an association study? Define linkage disequilibrium
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Association Tests for Rare Variants Using Sequence Data
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Hypothesis Testing Steps in Hypothesis Testing:
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
METHODS FOR HAPLOTYPE RECONSTRUCTION
Departments of Medicine and Biostatistics
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
1 Multiple Regression Interpretation. 2 Correlation, Causation Think about a light switch and the light that is on the electrical circuit. If you and.
Basics of Linkage Analysis
MALD Mapping by Admixture Linkage Disequilibrium.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Genome-Wide Association Studies
Chi-square test Pearson's chi-square (χ 2 ) test is the best-known of several chi-square tests. It is mostly used to assess the tests of goodness of fit.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Analysis of genome-wide association studies
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Genome-Wide Association Study (GWAS)
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Introduction to sample size and power calculations Afshin Ostovar Bushehr University of Medical Sciences.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Genome wide association studies (A Brief Start)
The International Consortium. The International HapMap Project.
ANOVA, Regression and Multiple Regression March
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Tutorial I: Missing Value Analysis
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Why you should know about experimental crosses. To save you from embarrassment.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Inferential Statistics Psych 231: Research Methods in Psychology.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Genome-Wides Association Studies (GWAS) Veryan Codd.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Methods of Presenting and Interpreting Information Class 9.
upstream vs. ORF binding and gene expression?
Genome Wide Association Studies using SNP
CHAPTER 29: Multiple Regression*
Lecture 9: QTL Mapping II: Outbred Populations
Presentation transcript:

Statistical Issues in Genetic Association Studies Eleanor Feingold, Ph.D. University of Pittsburgh March, 2011

Underlying Principle of Genetic Mapping People who have similar traits (phenotypes) should have greater than expected sharing of genetic material near the genes that influence those traits.

Basic study designs for gene mapping families unrelated individuals

Basic study designs for gene mapping families unrelated individuals linkage analysis (or association) association analysis

Basic study designs for gene mapping families unrelated individuals semi-related individuals in an inbred population association analysis linkage analysis (or association) ?

Basic study designs for gene mapping families unrelated individuals semi-related individuals in an inbred population association analysis linkage analysis (or family- based association) ?

Association analysis (circa 2000) 1) Collect cases and controls. aa 2) Genotype everyone at a marker. AA Aa 3) Test genotype/phenotype association. 4) Call it a day and go out for a beer with your co-investigators. AA Aa aa cases 65 133 202 controls 16 81 316

GWAS Study circa 2010 Repeat 1,000,000 times! 1) Collect cases and controls. 2) Genotype everyone at a marker. AA Aa aa 3) Test genotype/phenotype association. 4) Call it a day and go out for a beer with your co-investigators. AA Aa aa cases 65 133 202 controls 16 81 316

So what’s the BIG DEAL? Well, not much, until you get into 1) the complexities of array data, and 2) the real science of genetics.

One important genetic subtlety Even in a GWAS study, we can’t test every variant on the genome. So at the design phase, we have to pick markers (SNPs) that we hope will “cover” as well as possible, and at the testing phase, we do not expect that the marker we are testing is actually the “causal variant” - we are usually hoping (at best) that it is correlated with the true causal genetic variable.

Gene in here somewhere

Gene in here somewhere

After many generations ...

Within a population, genotypes at nearby SNPs are correlated due to population history. This correlation is called linkage disequilibrium.

“Tag” SNPs Find a set of SNPs that captures most information at least cost. How? Find clusters of SNPs that are highly correlated and then choose one representative from each cluster to genotype. Easily-available relatively idiot-proof software (e.g. Tagger). Caveat 1: You need a database that knows lots of SNPs in your gene and has genotyped them in a fair number of people in the population you are studying (Hapmap, Seattle SNPs). Caveat 2: Beware of overly-aggressive “tagging.”

Conventional association vs. candidate gene sequencing Candidate gene sequencing study 1) Expensive - fewer genes and fewer people, so lower power overall. 2) Find both common and rare variation. 3) Find functional variants. GWAS (tag SNP) study 1) Cheaper - more genes and more people, so higher power. 2) Find only common variation. 3) Probably do not find functional variants.

GWAS Analysis Genotype calling Data cleaning Single-SNP analysis Other analyses CNVs

Genotype “calling” Generally done before you see the data. But plenty of open questions about how to do it. - best clustering methods? salvage data from messy clusters? BB AB AA

Data cleaning Somewhat dependent on which chip you are using. Throw out “bad” SNPs and “bad” samples. (% of genotypes “called” for each person and each SNP) Hardy-Weinberg testing Relationship testing Find major chromosomal anomalies Look for population stratification Look for signs of systematic problems (e.g. allele frequencies differ by sample processing date).

Data cleaning examples

Plate effect on missing call rate per sample ANOVA p-value = 6e-48 But no significant association between plate and case status (p=0.20)

Gender Check

chromosomal anomalies

Testing Hardy-Weinberg Hardy-Weinberg Equilibrium (HWE) means that your three genotype groups occur in the expected p2, 2pq, q2 proportions. Departure from HWE most often indicates genotyping problems. But it can also indicate an actual genetic effect. (Check for case-control differences). Do your HWE tests by ethnicity, but don’t expect admixed groups (hispanics, African-Americans) to be in HWE.

HWE 10-4 < p < 0.5

HWE p < 10-4

population stratification via principle components

Analysis

Simple association test at every SNP Case-control association test by allele ... case 2 x 2 table (Fisher’s exact test or chi-squared test) control And by genotype ... AA Aa aa 2 x 3 table (Fisher’s exact test or chi-squared test or Armitage trend test) case control

Or use logistic regression Lets you incorporate other predictors (age, sex, diet, whatever). G + E (genotype + environment model) G + E + GxE (interaction model)

GWAS results Manhattan plot and qq plot

What’s the best single-SNP association test? Not as “solved” a problem as you’d think. If you knew the true model for the gene effect, you’d just fit that model. But you don’t. So which tests are robust over lots of models?

============= MIN 4P ============== Chia-Ling Kuo’s work ===== MIN 2P ===== ======== MIN 3P ======= ============= MIN 4P ==============

Scan with Covariates Which logistic regression model is best for testing GENETIC EFFECT? G: LR(G, NULL) ~ X2(1) G+E: LR(G+E, E) ~ X2(1) G+E+GE: LR(G+E+GE, E) ~ X2(2)

Results Combination statistics (best of several statistics) are most robust, even after correction for multiple comparisons, but linear trend test is also a good choice. To test for genetic effect, the G + E is almost never advantageous. Just test G, or fit G + E + GxE if you’re pretty sure there’s an interaction. BIG CAVEAT: This assumes G and E are independent – if you are worried about confounding, you DO need to control for E when testing G.

More generally, should you use the same statistics you used for a small-scale study? What do I mean? • Statisticians develop tests that behave sensibly on average. • But in genomic problems, we do 10,000 or 500,000 of the same test and then follow up the top 100 results. • So we need test statistics for which the extreme values are well-behaved, not so much the averages. Maybe not. Problem Need to worry about the statistical properties of the extreme values of the test statistics.

Example from expression arrays: “10,000 t-tests” analysis Compute t-statistic for each gene. Rank by absolute value of t-statistic.

Ranked list is dominated by small-variance genes. Problem Ranked list is dominated by small-variance genes. 2 ways to get a large t-statistic large difference between the means small SE With a small sample size, the SE estimates are very poor. If you estimate an SE poorly 10,000 times, some of the estimates will come out very small.

Solution Shrinkage estimator! (Add a fudge factor to the denominator of the t-statistic.)

Back to association studies ... Whatever statistic you are using (1,000,000 times), you need to know the statistical behavior of the 1st - 50th highest order statistics, not the statistical behavior on average. This issue has not really been dealt with in the association study literature.

A few other open statistical issues

Multiple testing The problem The solution If you do 1,000,000 tests, you will produce a lot of false positives. The solution There isn’t one! • Be realistic about hypothesis generating vs. hypothesis testing. • False discovery rate - controls percent of genes on list that are false. • Permutation testing - controls for lots of correlated tests.

“Imputaton” at untyped SNPs The idea Use Hapmap database to impute genotypes for your samples at all the SNPs in-between the ones you genotyped. Do a test at each of those SNPs in addition to the typed ones. Should increase overall study power even if multiple comparisons are correctly controlled for. typed SNP untyped SNP “blue” at typed SNP => “blue” at untyped one as well

“Imputaton” at untyped SNPs The best thing Allows joint analyses of datasets that were genotyped with different chips! Limitations Only helpful if correlation structure in Hapmap is valid for your population. Only helpful for SNPs in the database (contrast to haplotype analysis). Open questions • Best imputation methods in theory and practice? • What populations should you base the imputation on? • Imputed SNPs have different statistical properties (e.g. slightly higher variance) – how do we account for that?

Meta-analysis Typical GWAS papers now combine results from many studies. What are the best meta-analysis methods for doing this? - What if same SNPs not typed in all studies? - What if phenotype not measured the same way? - What if some SNPs are imputed?

Software for genetic association studies PLINK is the primary tool. Bioinformatics is incorporated. There are some useful R packages as well. Need R for fancier analyses – typically integrate it with PLINK. Lots of new stuff constantly under development for large-scale data management and viewing – WGAViewer, LocusZoom Lots of specialty packages for: HWE haplotype analysis family association other stuff