Consideration for Planning a Candidate Gene Association Study With TagSNPs Shehnaz K. Hussain, PhD, ScM skhussain@ucla.edu Epidemiology 243: Molecular Epidemiology
Objectives Molecular genetics primer Databases and tools to conduct in silico analyses for tagSNP selection/prioritization Factors influencing statistical power
Central dogma A T C G DNA mRNA Protein
What are SNPs? More than 99% of all nucleotides are the same in all humans 1% of nucleotides are polymorphic SNPs>> insertions-deletions Bi-nucleotide – T (80%) A (20%) Where do SNPs occur? Exons Introns Flanking regions
What are haplotypes? A haplotype is the pattern of nucleotides on a single chromosome Two “copies” of each chromosome The haplotype inference problem T T C G T A ? T ? G ? A TA TT CG GG TA AA ? T ? G ? A A T G G A A
What is linkage disequilibrium? Linkage disequilibrium (LD) describes the non-random association of nucleotides on the same chromosome in a population One nucleotide at one position (locus) predicts the occurrence of another nucleotide at another locus No LD LD Another closely related concept is Linkage Disequilibrium The technical definition for LD is as follows: blah It is a population measure, so it is not something that is unique to an individual Describe figures: Here is an example where we have no LD We have 4 chromosomes indicated by these blue lines Lets assume we have two SNPs, one here and one here The variant, or minor allele of the SNP is indicated by either a purple dot at position 1 or a red dot at position 2 In this example, we see four potential scenarios, which occur at equal frequencies, which indicated that we have no LD In this next example, we have high LD, because when the variant allele of position 1 is present, so is the variant allele of position 2
What are markers? Disease Phenotype Test for association between phenotype and marker loci Test for genetic association between the phenotype and the DSL LD Candidate gene Marker loci (SNPs) Disease Susceptibility Locus
Disease Susceptibility Locus What are tagSNPs? TagSNPs are a subset of all SNPs in a gene that mark groups of SNPs in LD Avoids redundant genotyping LD LD Marker loci (SNPs) Disease Susceptibility Locus
The joint effect of tagSNPs in cytokine genes and cigarette smoking in cervical cancer risk
T-cell proliferation IL - 2 gene IFN γ Activated T cell Proliferation of TH1 cells receptor Proliferation of TH1 - cells IL IL - - 2 2 IL IL - - 2 2 gene gene IL - 2 receptor IFN γ gene Activated T Activated T - - cell cell
Background Cigarette smoking ↑ 1.5- to 3-fold cancer risk Cigarette smoking ↓ levels of IL-2 and IFNγ (cervical and circulating) ↓ levels of IL-2 and IFNγ HPV persistence in the cervix Cervical neoplasia Decreased survival from invasive cervical cancer
Model Cigarette smoking HPV-associated squamous cell cervical cancer SNPs in IL-2, IL-2R, and IFNG
Methods Study design Subjects Data collection Population-based case-only study Subjects 308 Caucasian squamous cell cervical cancer cases diagnosed 1986-2004 Residing in 3 western Washington counties Data collection Structured in–person interviews DNA isolated from buffy coats
Objectives Molecular genetics primer Databases and tools to conduct in silico analyses for tagSNP selection/prioritization Factors influencing statistical power
Multi-stage tagSNP design Select reference panel Re-sequence panel, identify SNPs (many markers, few subjects) Choose tagSNPs Genotype tagSNPs in main study (few markers, many subjects)
1. Select reference panel Definition A sample of your study population Most representative Samples from the Coriell Repository Ability to integrate your data with other resources = Candidate gene SNPs = HapMap SNPs
2. Re-sequence reference panel Amplify and Sequence DNA Gene PolyPhred Phred Phrap (Nickerson, 1997) (Ewing, 1998)
Alternatives to re-sequencing Program for Genomic Applications (PGA) SeattleSNPs – inflammation NIEHS SNPs – environmental response Innate Immunity International HapMap Project 5 million SNPs in four ethnically distinct populations
3. Choose tagSNPs (LD) Option LDSelect Tagger r2 threshold (0.80) Yes (Carlson, 2002) Tagger (de Bakker, 2005) r2 threshold (0.80) Yes SNP exclusions/inclusions No SNP design score
LDSelect output for IL-2 SeattleSNPs, r2≥0.80, MAF ≥0.05, Caucasians Bin Total Number of Sites TagSNPs 1 2 rs2069763 rs2069772 rs2069776 rs2069778 3 rs2069777 rs2069779 4 rs2069762
Genomic context Exons (cSNPs) Upstream flanking region SIFT (Ng, 2002) PolyPhen (Ramensky, 2002) Upstream flanking region Intron-exon junctions
Sequence conservation UCSC Genome Browser, PhasCons (Siepel, 2005) Score Repeat region Unique region
Objectives Molecular genetics primer Databases and tools to conduct in silico analyses for tagSNP selection/prioritization Factors influencing statistical power
Minor allele frequency and genetic model 300 cases, 300 controls, alpha=0.05
Sample size requirement LD SNPs genotyped SNPs not genotyped r2 Sample size requirement S1 and S2 - 600 S1 S2 1.00 0.85 706 S1 S2 N/r2 (Pritchard, 2001)
Genotype error Generally non-differential Reduces your power Every 1% increase in genotyping error rates requires sample size increased by 2-8% (Zou et al, 2004, Genetic Epidemiology) Depends on error model
Power calculators Quanto htPowercc G, E, G X E, G X G Case-control, case-sibling, case-parent, and case-only designs Quantitative or binary outcome htPowercc r2 Power for Association With Error (PAWE) Genotyping errors
TagSNP summary Efficient yet comprehensive coverage of the genetic variation in our candidate genes Reduce costs Preference should be given to putatively functional variants: Literature, gene context, sequence conservation Influences of statistical power: MAF, genetic model, LD, and genotyping error
Programs for Genomic Applications SeattleSNPs, http://pga.mbt.washington.edu NIEHS, http://egp.gs.washington.edu/ Innate Immunity, http://innateimmunity.net/ International HapMap, http://www.hapmap.org/ Coriell cell repository, www.coriell.org cSNP predictive analysis: SIFT, http://blocks.fhcrc.org/sift/SIFT.html PolyPhen, http://coot.embl.de/PolyPhen Vista, http://genome.lbl.gov/vista/index.shtml The following programs can be found at the Rockefeller site, http://linkage.rockefeller.edu/soft/ Tagger LDSelect PAWE Quanto