Inferring Missing Genotypes in Large SNP Panels

Inferring Missing Genotypes in Large SNP Panels
Adam Roberts, Leonard McMillan, Wei Wang, Joel Parker, Ivan Rusyn, and David Threadgill University of North Carolina at Chapel Hill, USA

Motivation and Overview
High-throughput genotyping techniques yield many missing calls We have developed fast algorithms for inferring missing genotypes Tested on isogenic animals (recombinant-inbred lines) where phasing is not a confounding issue Our method delivers accuracy competitive to the best imputation algorithms but only costs a few s per imputation.

Mouse SNP Data SNP Strain A (A/J) Strain B (BALB) Strain C (B6)
Strain D (C3) Strain E (DBA) 1.2830 C T G A Y

Mouse SNP Data SNP Strain A (A/J) Strain B (BALB) Strain C (B6)
Strain D (C3) Strain E (DBA) 1.2830 1 Y

Realistic SNP Data Typical genotyping technologies give “no-calls” for approximately 5%-10% of a SNP dataset Strains A B C D E . 1 SNPs

Realistic SNP Data Typical genotyping technologies give “no-calls” for approximately 5%-10% of a SNP dataset Four options: Modify tools to accommodate missing data Throw away SNPs Resequence Prohibitively expensive Impute Less accurate but “free” Strains A B C D E . 1 ? SNPs

Previous Imputation Approaches
Hidden Markov Models (Stephens et al., 2001; Lin et al., 2002; Niu et al., 2002) Entropy Measures (Su et al., 2005) Expectation Maximization (Qin et al., 2002) Tree-Based Perfect Phylogeny (Eskin et al., 2003) Despite of their methodological differences, they have two things in common: Complex Slow

NPUTE A simple method for imputing missing genotypes based on a “nearest-neighbor” approach within arbitrary windows An efficient data structure for finding pairwise haplotype similarity This simplicity leads to benefits in: Speed Exhaustive searches over multiple parameters The result is a fast imputation approach with competitive accuracy.

Imputation Approach Ideal Method: Our Method:
Within a haplotype block, find the nearest neighbor to the strain missing a genotype and fill it in with the neighbor’s value. Problem: Finding haplotype blocks is a very difficult and time consuming problem on its own. Our Method: Find the nearest neighbor within a window extending L SNPs above and below the missing value.

How to Find the Best Window
We consider all symmetric windows of size 2L+1 for each practical L across the genome and use the closest match to “impute” all known values. Accuracy is estimated by imputing values of every known site for each L. The best L is an estimate of the average haplotype block size and is used for the imputation of “no-calls”.

Naïve Approach Strains A B C D E . 1 ? SNPs

Naïve Approach L = 2 . 1 ? Strains A B C D E SNPs Scoring Function ? 1
? L = 2 SNPs Scoring Function ? 1 0.5

Naïve Approach Strains A B C D E . 1 ? L = 2 A B C E 1.5 2 3.5 SNPs

NPUTE Data Structures Begin with ternary SNPs Sij  {0, 1, ?}
Build Pairwise Mismatch Vector (PMV) for each SNP (scaled by 2 to allow integer arithmetic) 0 = Match 1 = Unknown 2 = Mismatch Sum PMVs to make Mismatch Accumulator Array (MAA) Constant time lookup for the PMV over any window using row subtraction 2202 020 20 2 1 10010 MAA Mismatch Vector SNPs 10010 10001 011?0 00101 0?100 0??01

NPUTE Approach . 1 ? Strains A B C D E SNPs MAA 35 59 52 32 55 52 45 3
. 1 ? SNPs 35 59 52 32 55 52 45 3 4 7

NPUTE on Real Data Perlegen Data (http://mouse.perlegen.com) 150K Data
8.3 million SNPs 16 mouse strains 11.1% missing calls 150K Data 140K Broad/MIT mouse dataset + 10K GNF mouse dataset 46 mouse strains 4.2% missing calls

NPUTE on Perlegen Data

NPUTE on Perlegen Data 8.3 Million SNPs with 16 strains
We estimate that it will take 88 days for fastPhase to impute the data 60 s per imputation, ~135 minutes for entire dataset

NPUTE on 150K Data

65 s per imputation, ~7.5 minutes for the entire dataset
NPUTE on 150K Data 150 K SNPs with 46 Strains 65 s per imputation, ~7.5 minutes for the entire dataset

Extensions to NPUTE We can establish a measure of confidence in our calls based on the fraction of matching values of the nearest neighbor. A threshold can be set to only impute high confidence calls. Imputation can proceed iteratively allowing high- confidence calls to aid in the imputation of lower-confidence calls.

Summary Available at http://compgen.unc.edu
Better or competitive accuracy to alternative approaches Orders of magnitude faster O(NS2) space where N is the number of SNPs S is the number of strains O(S) time per imputation O(NS2) time for the whole genome Enables genome wide imputation Further optimization and extension

MAA is Versatile Small tweaks to the Mismatch Accumulator Array (MAA) support a variety of queries Finding local regions of Identity-by-descent Counting the number of unique haplotypes within arbitrary windows Query speed is independent of window size

Acknowledgement: EPA STAR RD832720 NSF IIS 0448392 NSF IIS 0534580
Questions?

Inferring Missing Genotypes in Large SNP Panels

Similar presentations

Presentation on theme: "Inferring Missing Genotypes in Large SNP Panels"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inferring Missing Genotypes in Large SNP Panels

Similar presentations

Presentation on theme: "Inferring Missing Genotypes in Large SNP Panels"— Presentation transcript:

Similar presentations

About project

Feedback