Inferring Missing Genotypes in Large SNP Panels Adam Roberts, Leonard McMillan, Wei Wang, Joel Parker, Ivan Rusyn, and David Threadgill University of North Carolina at Chapel Hill, USA
Motivation and Overview High-throughput genotyping techniques yield many missing calls We have developed fast algorithms for inferring missing genotypes Tested on isogenic animals (recombinant-inbred lines) where phasing is not a confounding issue Our method delivers accuracy competitive to the best imputation algorithms but only costs a few s per imputation.
Mouse SNP Data SNP Strain A (A/J) Strain B (BALB) Strain C (B6) Strain D (C3) Strain E (DBA) 1.2830 C T 1.3201132 G 1.122926781 A 2. 58304197 2.166182685 3.3026173 Y.277893
Mouse SNP Data SNP Strain A (A/J) Strain B (BALB) Strain C (B6) Strain D (C3) Strain E (DBA) 1.2830 1 1.3201132 1.122926781 2. 58304197 2.166182685 3.3026173 Y.277893
Realistic SNP Data Typical genotyping technologies give “no-calls” for approximately 5%-10% of a SNP dataset Strains A B C D E . 1 SNPs
Realistic SNP Data Typical genotyping technologies give “no-calls” for approximately 5%-10% of a SNP dataset Four options: Modify tools to accommodate missing data Throw away SNPs Resequence Prohibitively expensive Impute Less accurate but “free” Strains A B C D E . 1 ? SNPs
Previous Imputation Approaches Hidden Markov Models (Stephens et al., 2001; Lin et al., 2002; Niu et al., 2002) Entropy Measures (Su et al., 2005) Expectation Maximization (Qin et al., 2002) Tree-Based Perfect Phylogeny (Eskin et al., 2003) Despite of their methodological differences, they have two things in common: Complex Slow
NPUTE A simple method for imputing missing genotypes based on a “nearest-neighbor” approach within arbitrary windows An efficient data structure for finding pairwise haplotype similarity This simplicity leads to benefits in: Speed Exhaustive searches over multiple parameters The result is a fast imputation approach with competitive accuracy.
Imputation Approach Ideal Method: Our Method: Within a haplotype block, find the nearest neighbor to the strain missing a genotype and fill it in with the neighbor’s value. Problem: Finding haplotype blocks is a very difficult and time consuming problem on its own. Our Method: Find the nearest neighbor within a window extending L SNPs above and below the missing value.
How to Find the Best Window We consider all symmetric windows of size 2L+1 for each practical L across the genome and use the closest match to “impute” all known values. Accuracy is estimated by imputing values of every known site for each L. The best L is an estimate of the average haplotype block size and is used for the imputation of “no-calls”.
Naïve Approach Strains A B C D E . 1 ? SNPs
Naïve Approach L = 2 . 1 ? Strains A B C D E SNPs Scoring Function ? 1 ? L = 2 SNPs Scoring Function ? 1 0.5
Naïve Approach Strains A B C D E . 1 ? L = 2 A B C E 1.5 2 3.5 SNPs
Naïve Approach Strains A B C D E . 1 ? L = 2 A B C E 1.5 2 3.5 SNPs
NPUTE Data Structures Begin with ternary SNPs Sij {0, 1, ?} Build Pairwise Mismatch Vector (PMV) for each SNP (scaled by 2 to allow integer arithmetic) 0 = Match 1 = Unknown 2 = Mismatch Sum PMVs to make Mismatch Accumulator Array (MAA) Constant time lookup for the PMV over any window using row subtraction 2202 020 20 2 1 10010 MAA Mismatch Vector SNPs 12 56 32 21 62 57 16 54 50 47 14 58 32 21 62 57 18 54 52 49 16 60 35 21 62 58 20 55 54 50 16 62 35 23 64 58 22 57 54 52 17 64 35 23 65 59 23 59 56 52 18 65 35 25 66 60 24 60 57 54 10 54 32 19 62 55 16 52 50 45 10010 10001 011?0 00101 0?100 0??01 2202 020 20 2 2202 020 20 2 2220 002 02 2 2210 012 12 1 0202 202 20 2 1200 111 22 0 1102 111 11 2
NPUTE Approach . 1 ? Strains A B C D E SNPs MAA 35 59 52 32 55 52 45 3 12 56 32 21 62 57 16 54 50 47 14 58 32 21 62 57 18 54 52 49 16 60 35 21 62 58 20 55 54 50 16 62 35 23 64 58 22 57 54 52 17 64 35 23 65 59 23 59 56 52 18 65 35 25 66 60 24 60 57 54 10 54 32 19 62 55 16 52 50 45 . 1 ? SNPs 35 59 52 32 55 52 45 3 4 7
NPUTE on Real Data Perlegen Data (http://mouse.perlegen.com) 150K Data 8.3 million SNPs 16 mouse strains 11.1% missing calls 150K Data 140K Broad/MIT mouse dataset + 10K GNF mouse dataset 46 mouse strains 4.2% missing calls
NPUTE on Perlegen Data
NPUTE on Perlegen Data
NPUTE on Perlegen Data 8.3 Million SNPs with 16 strains We estimate that it will take 88 days for fastPhase to impute the data 60 s per imputation, ~135 minutes for entire dataset
NPUTE on 150K Data
NPUTE on 150K Data
65 s per imputation, ~7.5 minutes for the entire dataset NPUTE on 150K Data 150 K SNPs with 46 Strains 65 s per imputation, ~7.5 minutes for the entire dataset
Extensions to NPUTE We can establish a measure of confidence in our calls based on the fraction of matching values of the nearest neighbor. A threshold can be set to only impute high confidence calls. Imputation can proceed iteratively allowing high- confidence calls to aid in the imputation of lower-confidence calls.
Summary Available at http://compgen.unc.edu Better or competitive accuracy to alternative approaches Orders of magnitude faster O(NS2) space where N is the number of SNPs S is the number of strains O(S) time per imputation O(NS2) time for the whole genome Enables genome wide imputation Further optimization and extension
MAA is Versatile Small tweaks to the Mismatch Accumulator Array (MAA) support a variety of queries Finding local regions of Identity-by-descent Counting the number of unique haplotypes within arbitrary windows Query speed is independent of window size
Acknowledgement: EPA STAR RD832720 NSF IIS 0448392 NSF IIS 0534580 Questions?