Download presentation
Presentation is loading. Please wait.
Published byGonca Akçam Modified over 5 years ago
1
New Statistical Methods for Family-Based Sequencing Studies
Correcting for confounding introduced by batch effects and imputation in family-based designs with whole genome sequence data Ellen M. Wijsman New Statistical Methods for Family-Based Sequencing Studies BIRS August 5-10, 2018
2
Motivation Interest in rare variation motivates family designs and next generation sequencing (NGS) Focus here on association analyses (LMM) accounting for kinship NGS, especially whole genome sequencing (WGS) is expensive Missing data is the norm: not all family members can or will be sequenced Subjects sequenced may favor cases/extreme phenotypes May want to re-use of existing data from other studies, e.g., as controls Genotype imputation can augment observed data WGS takes time & may be carried out at different locations Additional samples may be added later
3
Challenges Multiple potential factors can lead to biased association tests/confounding Sequencing Platform (HiSeq 2000, HiSeq X) Sequencing Center (BROAD, Baylor, Wash U) Software & protocols (Atlas, GATK, filters) Reference materials (Genome ver., SV reference files) Batches within Center Versions of software, reference info, etc. Differential genotype quality Read depth variation Genotype quality choices differ in calling Observed vs. imputed genotypes Raw reads (fastq files) Aligned reads (BAM files) Called genotypes (VCF files) Data analysis
4
The Alzheimer’s Disease Sequencing Project (ADSP) – family data (WGS)
Investigate the role of rare variation in Late Onset AD (LOAD) Learn about importance of coding vs. non-coding variation in LOAD Big enough to learn something about new proposed analysis approaches 110 multiplex families, selected to avoid families with APOE e4/e4 subjects ~3400 total subjects, ~30% with some genotype data 2-5 generations/family 40% White, 60% Dominican Republic Hispanic Supported by the NIH National Institute on Aging and the National Human Genome Research Institute Data available to qualified Investigators via dbGaP/NIAGADs
5
ADSP family data Phases Subjects/samples/trait data Genomic data
Discovery, GRCh37 ( ) 513 cases : 65 controls Extension, GRCh38 ( ) 30 cases : 221 controls Subjects/samples/trait data ADGC consortium (19 sites) CHARGE consortium (6 sites) NCRAD Genomic data SNP arrays: individual sites WGS: national sequencing centers Centers Baylor University Broad Institute Washington University Platforms HiSeq 2000 (Discovery) HiSeq X 10 (Extension) Genotype calling
6
Genotyping in ADSP families
Point out subjects with phenotype data that might get good imputed data 27 Subjects 12 GWAS No genos 6 WGS+ 6 no WGS 7 no phenos GWAS SNPs chromosome WGS SNVs
7
Imputation from Inheritance Vectors (IVs)
MORGAN/gl_auto & ~6000 SNPs uses MCMC or exact computation: pedigrees can be very large, and (somewhat) complex obtains samples of IVs at the SNPs in intact pedigrees IV1: [ ] IV2: [ ] IV3: [ ] Cheung et al 2013 AJHG 92:
8
Imputation from Inheritance Vectors (IVs)
Use sampled IVs for imputation from WGS with GIGI, averaging across multiple sampled sets of IVs Sample an IV at the position of a WGS variant, given IVs at flanking SNPs Realize genotypes at the WGS variant, given observed data Average across sampled realizations Result for subject j: Association testing with LMM (GEMINI), average imputed genotypes across sampled IVs (dose) infer WGS geno a A a a A A Cheung et al 2013 AJHG 92:
9
Initial association analysis: discovery data with imputed genotypes
Whites - genomewide Oh my gosh – what is going on? When your QQ plot looks like a giraffe, there is clearly a problem somewhere! AD ~ SNV + PCs
10
It took awhile to figure out...
The problem was: There wasn’t just one problem.
11
Simplify the data Use reduced number of WGS genotypes
Reduce computational burden Focus on variants that have “behaved” in the past Avoid ultra-rare variants Use previously vetted variants Use ~250,000 variants from the WGS also on SNP arrays
12
Association testing with imputed data: discovery data
WGS+ GWAS Initial QQ plot: Hispanic data No Obs. Data GWAS only WGS+GWAS: there is *some* missing data since not every WGS genotype is called AD ~ SNV + PCs 0.0 Imputation Deviance 0.6
13
Measuring imputation (un)certainty
Original purpose: provide a “vote” for best genotype when merging imputation from two sources High variance among genotype probabilities for a SNP indicates high certainty: V(p) = observed: V(p) = 1/3 no info: V(p) = 0 incomplete info: Saad & Wijsman, 2014 GEPI
14
Imputation deviance Imputation certainty is a data-driven proxy for accuracy Transform to imputation deviance: average over markers for a subject: Range: 0 (best) to 1 (worst) Low deviance implies high accuracy 1 Imputation deviance
15
Association testing: discovery sample
Original QQ plot +Imputation deviance Fixing the bad QQ plots Cause: Confounding ~8:1 case:control ratio with WGS Higher fraction of controls with (imperfect) imputed WGS Solution: adjust for admixture (or PCs) imputation status imputation deviance A B +Imp. deviance & status All Hispanic families cases vs. controls C AD ~ SNV + PCs A bad + imputation deviance B better + imputation status C best !
16
Association: Discovery+extension
Whites: WGS subjects only Extension data: New sequencing machines Called on new sequence build Heavily tilted towards controls Discovery+extension re-called together Sequencing platform: a new source of error needs additional correction Cov: PCs
17
Association: Discovery+extension
All whites (obs. & imputed) All whites (obs. & imputed) Cov: PCs, imp.dev, imp.status Cov: PCs, imp.dev, imp.status, platform Not perfect yet...
18
Summary New data, new problems
Multiple sources of differences in data sources can lead to confounding in analysis of NGS in pedigree data Simple batch effects are not always sufficient corrections A continuous measure related to the final product is helpful This is issue is going to become more common We may have identified an approach to deal with data that is pertinent in other situations case-control samples with population imputation – varies across the genome.
19
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.