Genome-wide Association Studies John S. Witte
Association Studies Hirschhorn & Daly, Nat Rev Genet 2005 Candidate Gene or GWAS
Affymetrix Array Genome-wide Association Studies Altshuler & Clark, Science 2005
Genome-wide Assocation Studies (GWAS)
GWAS+ Strategy Clarification: Sequencing+ Confirmation / Characterization: Follow-up Genotyping+ Discovery: Multi-stage GWAS+ # Markers # Samples Time
GWAS+ Strategy Clarification: Sequencing+ Confirmation / Characterization: Follow-up Genotyping+ Discovery: Multi-stage GWAS+ # Markers # Samples Time
1,2,3,………………………,N 1,2,3,……………………………, M SNPs Samples One-Stage Design Stage 1 Stage 2 samples markers Two-Stage Design 1,2,3,……………………………, M SNPs Samples 1,2,3,………………………,N One- and Two-Stage GWA Designs
SNPs Samples Replication-based analysis SNPs Samples Stage 1 Stage 2 One-Stage Design Joint analysis SNPs Samples Stage 1 Stage 2 Two-Stage Design
Multistage Designs Joint analysis has more power than replication p-value in Stage 1 must be liberal Lower cost—do not gain power
QC Steps Filter SNPs and Individuals – MAF, Low call rates Test for HWE among controls & within ethnic groups. Use conservative alpha-level Check for relatedness. Identity-by-state calculations.
Analysis of GWAS Most common approach: look at each SNP one-at-a-time. Possibly add in multi-marker information. Further investigate / report top SNPs only. Or backwards replication… P-values
GWAS Analysis Most commonly trend test. Log additive model, logistic regression. Adjust for potential population stratification.
Quantile-Quantile (QQ) Plot
chromosome Example: GWAS of Prostate Cancer Witte, Nat Genet 2007 Multiple prostate cancer loci on 8q24
LocusA FreqAssociation Chr RegSNPCntrlCaseORp valueNearby Genes / Fcn 2p15rs721048G/A x10 -9 EHBP1: endocytic trafficking 3p12rs C/T x10 -8 Intergenic 6q25rs C/T x SLC22A3: drugs and toxins. 7q21rs T/C x10 -9 LMTK2: endosomal trafficking 8q24 (2)rs C/A x Intergenic 8q24 (3)rs T/G x Intergenic 8q24 (1)rs C/A x Intergenic 10q11rs C/T x MSMB: suppressor prop. 10q26rs T/C x10 -8 CTBP2: antiapoptotic activity 11q13rs T/G x Intergenic 17q12rs G/A x HNF1B: suppressor properties 17q24rs T/G x Intergenic 19q13rs A/G x KLK2/KLK3: PSA Xp11rs T/C x10 -9 NUDT10, NUDT11: apoptosis Prostate Cancer Replications Witte, Nat Rev Genet 2009 Modest ORs
LocusA FreqAssociation Chr RegSNPCntrlCaseORp valueNearby Genes / Fcn 2p15rs721048G/A x10 -9 EHBP1: endocytic trafficking 3p12rs C/T x10 -8 Intergenic 6q25rs C/T x SLC22A3: drugs and toxins. 7q21rs T/C x10 -9 LMTK2: endosomal trafficking 8q24 (2)rs C/A x Intergenic 8q24 (3)rs T/G x Intergenic 8q24 (1)rs C/A x Intergenic 10q11rs C/T x MSMB: suppressor prop. 10q26rs T/C x10 -8 CTBP2: antiapoptotic activity 11q13rs T/G x Intergenic 17q12rs G/A x HNF1B: suppressor properties 17q24rs T/G x Intergenic 19q13rs A/G x KLK2/KLK3: PSA Xp11rs T/C x10 -9 NUDT10, NUDT11: apoptosis Prostate Cancer Replications Witte, Nat Rev Genet 2009 Modest ORs
LocusA FreqAssociation Chr RegSNPCntrlCaseORp valueNearby Genes / Fcn 2p15rs721048G/A x10 -9 EHBP1: endocytic trafficking 3p12rs C/T x10 -8 Intergenic 6q25rs C/T x SLC22A3: drugs and toxins. 7q21rs T/C x10 -9 LMTK2: endosomal trafficking 8q24 (2)rs C/A x Intergenic 8q24 (3)rs T/G x Intergenic 8q24 (1)rs C/A x Intergenic 10q11rs C/T x MSMB: suppressor prop. 10q26rs T/C x10 -8 CTBP2: antiapoptotic activity 11q13rs T/G x Intergenic 17q12rs G/A x HNF1B: suppressor properties 17q24rs T/G x Intergenic 19q13rs A/G x KLK2/KLK3: PSA Xp11rs T/C x10 -9 NUDT10, NUDT11: apoptosis SNPs Missed in Replication? Witte, Nat Rev Genet, ,223 smallest P-value!
Manolio et al. Clin Invest 2008www.genome.gov/gwastudies Prostate Cancer
Population Attributable Risks for GWAS Jorgenson & Witte, 2009 Smoking & lung cancer BRCA1 & Breast cancer
Limitations of GWAS Not very predictive Witte, Nat Rev Genet 2009 Example: AUC for Br Cancer Risk Gail = 58% SNPs = 58.9% G + S = 61.8% Wacholder et al. NEJM 2010
Limitations of GWAS Not very predictive Explain little heritability Focus on common variation Many associated variants are not causal
Where’s the Heritability? McCarthy et al., 2008 Many more of these? See: NEJM, April 30, 2009 Common disease rare variant (CDRV) hypothesis: diseases due to multiple rare variants with intermediate penetrances (allelic heterogeneity)
Will GWAS results explain more heritability? Possibly, if… 1.Causal SNPs not yet detected due to power / practical issues (e.g., not yet included in replication studies). 2.Stronger effects for causal SNPs: Associated SNP may only serve as a marker for multiple different causal SNPs.
Imputation of SNP Genotypes Estimate unmeasured or missing genotypes. Based on measured SNPs and external info (e.g., haplotype structure of HapMap). Increase GWAS power. Allow for combining data across different platforms (e.g., Affy & Illumina) (for replication / meta- analysis).
Imputation Example Study Sample HapMap/ 1K genomes Gonçalo Abecasis
Identify Match with Reference Gonçalo Abecasis
Phase chromosomes, impute missing genotypes Gonçalo Abecasis
Imputation Application Chromosomal Position Marchini Nature Genetics TCF7L2 gene region & T2D from the WTCCC data Observed genotypes black Imputed genotypes red.
Genome-wide Sequence Studies Trade off between number of samples, depth, and genomic coverage. MAF Sample SizeDepth0.5-1%2-5% 1,00020xperfect 2,00010xr 2 =0.98r 2 = ,0005xr 2 =0.90r 2 =0.98 Goncalo Abecasis
Near-term Design Choices For example, between: 1.Sequencing few subjects with extreme phenotypes: e.g., 200 cases, 200 controls, 4x coverage. Then follow- up in larger population. 2.10M SNP chip based on 1,000 genomes. 5K cases, 5K controls. Which design will work best…?
Many weak associations combine to risk? Score model: where – ln(OR i ) = ‘score’ for SNP i from ‘discovery’ sample – SNP ij = # of alleles (0,1,2) for SNP i, person j in ‘validation’ sample. – Large number of SNPs (m) x j associated with disease? Polygenic Models ISC / Purcell et al. Nature 2009
Purcell / ISC et al. Nature 2009 Application of Model
Application to CGEMs PCa GWAS 1,172 cases, 1,157 controls from PLCO Trial Oversampled more aggressive cases. Illumina 550K array. PCa & stratified by disease aggressiveness. Split into halves, resampling: – one as ‘discovery’ sample; – other as ‘validation’. LD filter: r 2 = 0.5. Witte & Hoffman 2010
Results for Prostate Cancer
Nat Rev Cancer 2010;10: Common Polygenic Model for Prostate and Breast Cancer? - CGEMs GWAS data on prostate and breast cancer. - Use one cancer as ‘discovery’ sample, the other as ‘validation’.
Results for PCa & BrCa
Complex diseases Diabetes Obesity Diet Physical activity Hypertension Hyperlipidemia Vulnerable plaques Atherosclerosis MI Genetic susceptibility Complex diseases: Many causes = many causal pathways!
Pathways Many websites / companies provide ‘dynamic’ graphic models of molecular and biochemical pathways. Example: BioCarta: May be interested in potential joint and/or interaction effects of multiple genes in one pathway.
Moving Beyond Genome Transcriptome: All messenger RNA molecules (‘transcripts’) Proteome: All proteins in cell or organism Metabolome: all metabolites in a biological organism (end products of its gene expression). Systems Biology