Association Analysis University of Louisville Center for Genetics and Molecular Medicine January 11, 2008 Dana Crawford, PhD Vanderbilt University Center.

Slides:



Advertisements
Similar presentations
15 The Genetic Basis of Complex Inheritance
Advertisements

Statistical methods for genetic association studies
How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.
What is an association study? Define linkage disequilibrium
Association Tests for Rare Variants Using Sequence Data
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Genetic Analysis in Human Disease
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
1 Case-Control Study Design Two groups are selected, one of people with the disease (cases), and the other of people with the same general characteristics.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Introduction to Risk Factors & Measures of Effect Meg McCarron, CDC.
1 FSTL4 and SEMA5A are associated with alcohol dependence: meta- analysis of two genome-wide association studies Kesheng Wang, PhD Department of Biostatistics.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Picking SNPs Application to Association Studies Dana Crawford, PhD SeattleSNPs PGA University of Washington March 20, 2006.
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Association Analysis SeattleSNPs March 21, 2006 Dr. Chris Carlson FHCRC.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Genome Variations & GWAS
Understanding Genetics of Schizophrenia
Chapter 7 Multifactorial Traits
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Analysis of genome-wide association studies
Rare and common variants: twenty arguments G.Gibson Homework 3 Mylène Champs Marine Flechet Mathieu Stifkens 1 Bioinformatics - GBIO K.Van Steen.
Multiple Choice Questions for discussion
Evidence-Based Medicine 4 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Molecular and Genetic Epidemiology Kathryn Penney, ScD January 5, 2012.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Multifactorial Traits
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
Genome-Wide Association Study (GWAS)
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Statistical Issues in Genetic Association Studies
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Genome wide association studies (A Brief Start)
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
An quick overview of human genetic linkage analysis Terry Speed Genetics & Bioinformatics, WEHI Statistics, UCB NWO/IOP Genomics Winterschool Mathematics.
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Introdcution to Epidemiology for Medical Students Université Paris-Descartes Babak Khoshnood INSERM U1153, Equipe EPOPé (Dir. Pierre-Yves Ancel) Obstetric,
Genome-Wides Association Studies (GWAS) Veryan Codd.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Quantitative genetics
Measures of disease frequency Simon Thornley. Measures of Effect and Disease Frequency Aims – To define and describe the uses of common epidemiological.
Genome Wide Association Studies using SNP
Migrant Studies Migrant Studies: vary environment, keep genetics constant: Evaluate incidence of disorder among ethnically-similar individuals living.
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
Genome-wide Associations
Chapter 7 Multifactorial Traits
Medical genomics BI420 Department of Biology, Boston College
Presentation transcript:

Association Analysis University of Louisville Center for Genetics and Molecular Medicine January 11, 2008 Dana Crawford, PhD Vanderbilt University Center for Human Genetics Research

Association Analysis Outline Study Design SNPs versus Haplotypes Analysis Methods Candidate Gene Whole Genome Analysis Replication and Function

Study Design Does your trait or phenotype have a genetic component? Segregation analysis Recurrence risks Heritability Other sources of evidence for a genetic component

Classic Segregation Analysis Determines if a major gene is involved Compares data to Mendelian models, such as Autosomal dominant Autosomal recessive X-linked Results can be used as parameters for linkage analysis (e.g. parametric LOD) Subject to ascertainment bias Note: More complex methods needed for complex traits

Recurrence Risks The chance that a disease present in the family will recur in that family “Lightning striking twice” If recurrence risk is greater in the family compared with unrelated individuals, the disease has a “genetic” component Suggests familial aggregation

Recurrence Risks Measured using the risk ratio (λ) Sibling risk ratio = λ s λs = sibling recurrence risk population prevalence Cystic fibrosis λs = (0.25/0.0004) = 500 Huntington disease λs = (0.50/0.0001) = 5000

Recurrence Risks: Complex traits λ here is for first degree relative Merikangas and Risch (2003) Science 302:

Heritability Think “twin studies” The proportion of phenotypic variation in a population attributable to genetic variation Quantitative traits Heritability measured as h 2 (Can also be family studies)

Heritability and Quantitative Traits Determined by genes and environment BoysGirls Mexican Americans Blacks Whites Mexican Americans Blacks Whites Example: Height NHANES versus NHANES Freedman et al (2006) Obesity 14:

Heritability and Quantitative Traits Trait variation = genetic + environment Genetic variation = additive + dominant σ T 2 = σ G 2 + σ E 2 σ G 2 = σ a 2 + σ d 2 σ E 2 = σ f 2 + σ e 2 Environmental variation = familial/household + random/individual h B 2 = σ G 2 / σ T 2 Broad Sense heritability Narrow Sense heritability h N 2 = σ a 2 / σ T 2

Heritability and Twins Studies h 2 = 2(r MZ – r DZ ), where r is the correlation coefficient Monozygotic = same genetic material = r ~ 100% Dizygotic = half genetic material = r ~ 50%

Heritability and Twins Studies Traitr(MZ)r(DZ)Reference Cholesterol Fenger et al SBP Evans et al BMI Schousboe et al Perceived pitch Drayna et al

Heritability: Is everything genetic? Traitr(MZ)r(DZ)Reference Vote choice Hatemi et al Religiousness Koenig et al

Other Evidence For A Genetic Component Monogenic disorders Example: Phenotype of interest is sensitivity to warfarin dosing, but there are no heritability estimates Solution: Rare, familial disorder of warfarin resistance

Other Evidence For A Genetic Component Case Reports Example: Phenotype of interest is susceptibility to Neisseria meningitidis (prevalence: 1/100,000) Solution: Case report of recurrent N. meningitidis in patient

Other Evidence For A Genetic Component Animal models Biochemistry or biological pathways Expression data Previous genetic association studies Other good arguments…

Study Design How well can you diagnose the disease or measure the trait? Narrow definitions better than all-inclusive definitions There are many paths that lead to the same phenotype Avoid misclassification and measurement error Direct measurement versus recall/survey data or indirect proxies Be aware of age of onset Can your control become a case over time? Arguably most important step in study design

Target Phenotypes Disease or Quantitative trait? Carlson et al. (2004) Nature 429: MI CRP LDL-C IL6 LDLR Acute Illness Diet Note: SNPs associated with quantitative traits may not be associated with clinical endpoint

Study Design How many cases and controls will you need to detect an association? Statistical Power Null hypothesis: all alleles are equal risk Given that a risk allele exists, how likely is a study to reject the null? Study sample size ideally determined before you begin to recruit and genotype

Statistical significance –Significance = p(false positive) –Traditional threshold 5% Statistical power –Power = 1- p(false negative) –Traditional threshold 80% Traditional thresholds balance confidence in results against reasonable sample size Study Design What are the thresholds/variables in a general power calculation? Note: Significance threshold for 1 SNP tested

Study Design Power Calculation Resources Quanto (hydra.usc.edu/gxe/) Supports quantitative, discrete traits (unrelated and family based) Genetic Power Calculator (pngu.mgh.arvard.edu/~purcell/gpc/) Supports discrete traits, variance components, quantitative traits for linkage and association studies (List of other software: linkage.rockefeller.edu/soft/)

Study Design How can you maximize power for your study? Large sample size Better estimate of variability or risk Chance of misclassification / measurement error Large genetic effect size SNP risk allele with large odds ratio or explains a lot of trait variance This is unknown at beginning of study Risk SNP is common This is unknown at beginning of study Calculate power for a range of common MAFs (5-45%) Genotype the risk SNP directly Risk SNP is unknown at beginning of study Remember tagSNPs are imperfect proxies Adjust sample size by 1/r 2

Study Design Calculated using Quanto MAF Power calculation example: Cases: Adverse reaction (wheezing) to flu vaccination Controls: Vaccinated children with no adverse reactions

Study Design Power calculation example: Immunogenicity to influenza A (H5N1) vaccine Calculated using Quanto 1.1.1

Study Design Why are you considering an association study instead of linkage? Linkage analysis is powerful for disorders with – Discernable pattern of inheritance – Rare alleles w/ large genetic effect sizes – High penetrance Not powerful for disorders that – have complex pattern of inheritance – are common – many risk alleles with small effect sizes – have low penetrance

Common variant/common disease hypothesis Common genetic variants confer susceptibility Risk-conferring alleles ancient; common across most populations Risk-conferring allele has small effect Multiple risk alleles expected for common disease; also environment Study Design

Should you design a candidate gene or whole genome study? Candidate gene association study – Interrogate specific genes or regions – Based on previous knowledge or biological plausibility – Hypothesis testing Whole genome association study – Interrogate the “entire” genome – No previous knowledge required – Hypothesis generation

Candidate gene association studies Choose gene based on previous knowledge – Gene function – Biological pathway – Previous linkage or association study Choose DNA variations for genotyping – Direct association approach – Indirect association approach

Direct Candidate Gene Association Study Genotype “functional” SNPs Collins et al (1997) Science 278: Example: Nonsynonymous SNPs

Direct Candidate Gene Association Study Botstein and Risch (2003) Nat Genet 33 Suppl: Problem: We don’t know what is functional and what is not functional

Direct Candidate Gene Association Study What would we miss? Functional synonymous SNPs in MDR1 alter P-glycoprotein activity Komar (2007) Science 315:

Direct Candidate Gene Association Study What would we miss? 99% human genome is non-coding Non-coding SNPs or DNA variations in – Introns – Intergenic regulatory regions

Indirect Candidate Gene Association Study Genotype a fraction of all SNPs regardless of “function” Rely on SNP-SNP correlations (linkage disequilibrium) to capture information for SNPs not genotyped Kruglyak (2005) Nat Genet 37:

Indirect Candidate Gene Association Study Linkage disequilibrium (LD) Measured by r 2 r 2 = [f(A 1 B 1 ) – f(A 1 )f(B 1 )] 2 f(A 1 )f(A 2 )f(B 1 )f(B 2 ) r 2 = 0SNPs are independent r 2 = 1SNPs are perfectly correlated AND have the same minor allele frequency

Indirect Candidate Gene Association Study Using LD to pick “tagSNPs” CRP European-descent 10 SNPs >5% MAF CRP European-descent 4 tagSNPs r 2 >0.80

Indirect Candidate Gene Association Study “tagSNPs” are population specific CRP European-descent 4 tagSNPs CRP African-descent 10 tagSNPs

Indirect Candidate Gene Association Study “tagSNPs” are population specific Merge sets for “cosmopolitan” set

Indirect Candidate Gene Association Study Multiple testing Testing many SNPs for association with disease status No consensus on correcting p-value – Bonferroni – False Discovery Rate Need to replicate findings in independent study

Indirect Candidate Gene Association Study: Pros and Cons Can interrogate all common SNPs in gene SNPs must be known and genotypes available to calculate LD and pick tagSNPs Multiple testing within a gene Limited to previous knowledge

Whole Genome Association Study Can now genotype 100K – 1 million SNPs Coverage depends on platform and chip – tagSNPs capturing HapMap common SNPs – Genic SNPs overrepresented – Conserved non-coding SNPs represented – Evenly spaced across genome Illumina Infinium assay Affymetrix GeneChips

Whole Genome Association Study Same study design and challenges as candidate gene – Mostly case-control (retrospective) – Multiple testing Data storage and higher-order interaction testing issues Hypothesis generation tool (replication)

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006) Case/Control Study Designs For either candidate gene or whole genome

Study Pros Cons Case/Control Easier to collect Subject to bias Less expensive No risk estimates Case/Control Study Designs: Pros and Cons Prospective Risk estimates Harder to collect More expensive Subject to bias For rare outcomes, case/control design may be only option

Case/Control Study Designs: Pros and Cons Types of bias Bias in selection of cases Those that are currently living Miss fatal or short episodes of disease Might miss mild diseases Referral/admission bias Non-response bias Exposure suspicion bias Family information bias Recall bias Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006) Often ignored in genetic association studies

Analysis Methods Genotype QC Test for departures of Hardy-Weinberg Equilibrium Test for gender inconsistencies Eliminate very rare SNPs (no power) Eliminate SNPs with low genotyping efficiency Eliminate samples with low genotyping efficiency

Analysis Methods What statistical methods do you use to analyze your data? SNP by SNP (borrowed from epidemiology) Chi-square and Fisher’s exact 2x2 table 2x3 table Logistic and linear regression Covariates Haplotypes Haplo.stats and regression Interactions Traditional regression MDR (Ritchie et al)

Analysis Methods Case Control Minor allele A B Major allele C D Odds ratio (OR) = ratio of odds of minor allele in Cases (A/C) and Controls (B/D) OR (A*D)/(B*C) The Case/Control Study

Case Control Aa A B AA C D For genotypes, set homozygous for major allele (A) as “referent” genotype, and calculate 2 odds ratios: Case Control aa A B AA C D Analysis Methods

Case/control: Interpretation of Odds Ratio 1.0 – Referent >1.0 – Greater odds of disease compared with controls <1.0 – Lesser odds of disease compared with controls Confidence Intervals: probably contain true OR OR does not measure risk*

Prospective cohort Disease free at beginning of study Followed over time for disease (“incident”) Follow “exposed” and “unexposed” groups Gold-standard study design Analysis Methods

Prospective cohort Case ControlTotal Exposed A B(A+B) Unexposed C D(C+D) Risk Ratio (RR) = Incidence of disease in Exposed A/(A+B) or UnexposedC/(C+D)

Prospective Study: Interpretation of Risk Ratio 1.0 – Referent >1.0 – Risk for disease increases <1.0 – Risk for disease decreases Confidence Intervals: probably contain true RR *For rare diseases, OR ~ RR Analysis Methods

Case/control: Matching AgeGenderRace Warning: Can “over match” and miss describing an interesting factor Bad Example: Cases: Adults with heart disease Controls: Newborns without heart disease Analysis Methods

Case/control: Stratifying AgeGenderRace Warning: Need sufficient sample size to stratify or split the data into males and females Ex. Cases with heart disease Aged-matched controls without heart disease (Exposure: smoking status) Stratify for Gender Specific Risks Analysis Methods

Problems in Case/Control genetic association studies – “Confounding” by race or ancestry AKA population stratification Solutions: Match Stratify Adjust (using genetic markers) “Trios” Cardon and Palmer (2003) Lancet 361: Analysis Methods

Given –Height as “target” or “dependent” variable –Sex as “explanatory” or “independent” variable Fit regression model height =  *sex +  Analysis Methods Regression

Analysis Methods Given –Quantitative “target” or “dependent” variable y –Quantitative or binary “explanatory” or “independent” variables x i Fit regression model y =  1 x 1 +  2 x 2 + … +  i x i +  Regression

Works best for normal y and x Can include covariates Fit regression model y =  1 x 1 +  2 x 2 + … +  i x i +  Estimate errors on  ’s Use t-statistic to evaluate significance of  ’s Use F-statistic to evaluate model overall Use R 2 to evaluate variance explained by model Analysis Methods Regression

Analysis Methods Coding Genotypes 000GG 011AG 121AA RecessiveAdditiveDominantGenotype Genotype can be re-coded in any number of ways for regression analysis

Example of gene-environment Interaction and traditional regression

Analysis Methods Statistical Packages for Genetic Association Studies Candidate gene association study SAS/Genetics STATA SPSS R PLINK Whole genome association study R PLINK

Analysis Methods Whole genome in PLINK (pngu.mgh.harvard.edu/~purcell/plink/) MHC removed Can adjust for population stratification Can add covariates P<1x P<2x P<5x10 -8 Genome-wide significance P=5x10 -8 Plenge et al 2007 NEJM

SNPs versus Haplotypes There is no right answer: explore both The only thing that matters is the correlation between the assayed variable and the causal variable Sometimes the best assayed variable is a SNP, sometimes a haplotype

SNPs versus Haplotypes Haplo.stats (haplotype regression) Lake et al, Hum Hered. 2003;55(1): PHASE (case/control haplotype) Stephens et al, Am J Hum Genet Mar;76(3): Haplo.view (case/control SNP analysis) Barrett et al, Bioinformatics Jan 15;21(2): SNPHAP (haplotype regression?) Sham et al Behav Genet Mar;34(2): Statistical Packages for Genetic Association Studies with haplotypes

Analysis Methods Multiple testing Bonferroni correction Too conservative b/c each SNP tested may not be independent (LD) How many independent tests did you do? See Conneely and Boehnke AJHG (in press) False Discovery Rate Also has arbitrary threshold Best bet is replication

Statistical Replication Carlson et al. AJHG 2005;77:64-77 Results Consistent with CARDIA CRP SNPs and CRP levels in NHANES III Crawford et al Circulation 2006; 114:

Statistical replication is not always possible Association may imply mechanism Test for mechanism at the bench –Is predicted effect in the right direction? –Dissect haplotype effects to define functional SNPs Functional Replication

CRP Evolutionary Conservation TATA box: 1697 Transcript start: 1741 CRP Promoter region (bp ) >75% conserved in mouse

Functional Replication Low CRP Levels Associated with H1-4 USF1 (Upstream Stimulating Factor) –Polymorphism at 1440 alters USF1 binding site H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5-6 gcagctacCACGTGcacccagatggcCACTTGtt

High CRP Levels Associated with H6 USF1 (Upstream Stimulating Factor) –Polymorphism at 1421 alters another USF1 binding site H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt Functional Replication

CRP Promoter Luciferase Assay Carlson et al, AJHG v77 p64 Functional Replication

Association Analysis Outline Study Design SNPs versus Haplotypes Analysis Methods Candidate Gene Whole Genome Analysis Replication and Function