Inferring Missing Genotypes in Large SNP Panels

Slides:



Advertisements
Similar presentations
Lindsey Bleimes Charlie Garrod Adam Meyerson
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Experimental Design, Response Surface Analysis, and Optimization
Evaluating Inforce Blocks Of Disability Business With Predictive Modeling SOA Spring Health Meeting May 28, 2008 Jonathan Polon FSA
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
1.  Texturing is a core process for modeling surface details in computer graphics applications › Texture mapping › Surface texture synthesis › Procedural.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
Iterative closest point algorithms
Heuristic alignment algorithms and cost matrices
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
BNFO 602 Multiple sequence alignment Usman Roshan.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Novel computational methods for large scale genome comparison PhD Director: Dr. Xavier Messeguer Departament de Llenguatges i Sistemes Informàtics Universitat.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Efficient Editing of Aged Object Textures By: Olivier Clément Jocelyn Benoit Eric Paquette Multimedia Lab.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Imputation 2 Presenter: Ka-Kit Lam.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Getting Parameters from data Comp 790– Coalescence with Mutations1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
California Pacific Medical Center
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Conditional Markov Models: MaxEnt Tagging and MEMMs
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
ICS 353: Design and Analysis of Algorithms Backtracking King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
The Haplotype Blocks Problems Wu Ling-Yun
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Yufeng Wu and Dan Gusfield University of California, Davis
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning to Align: a Statistical Approach
SIMILARITY SEARCH The Metric Space Approach
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Neural Networks Winter-Spring 2014
Constrained Hidden Markov Models for Population-based Haplotyping
Results for all features Results for the reduced set of features
The ideal approach is simultaneous alignment and tree estimation.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Evaluating classifiers for disease gene discovery
Department of Computer Science
Imputation-based local ancestry inference in admixed populations
Learning Markov Networks
Haplotype Reconstruction
Foundation of Video Coding Part II: Scalar and Vector Quantization
Sequential Hierarchical Clustering
Minwise Hashing and Efficient Search
Chapter 14 Shading Models.
Topological Signatures For Fast Mobility Analysis
ICS 353: Design and Analysis of Algorithms
Presentation transcript:

Inferring Missing Genotypes in Large SNP Panels Adam Roberts, Leonard McMillan, Wei Wang, Joel Parker, Ivan Rusyn, and David Threadgill University of North Carolina at Chapel Hill, USA

Motivation and Overview High-throughput genotyping techniques yield many missing calls We have developed fast algorithms for inferring missing genotypes Tested on isogenic animals (recombinant-inbred lines) where phasing is not a confounding issue Our method delivers accuracy competitive to the best imputation algorithms but only costs a few s per imputation.

Mouse SNP Data SNP Strain A (A/J) Strain B (BALB) Strain C (B6) Strain D (C3) Strain E (DBA) 1.2830 C T 1.3201132 G 1.122926781 A 2. 58304197 2.166182685 3.3026173 Y.277893

Mouse SNP Data SNP Strain A (A/J) Strain B (BALB) Strain C (B6) Strain D (C3) Strain E (DBA) 1.2830 1 1.3201132 1.122926781 2. 58304197 2.166182685 3.3026173 Y.277893

Realistic SNP Data Typical genotyping technologies give “no-calls” for approximately 5%-10% of a SNP dataset Strains A B C D E . 1 SNPs

Realistic SNP Data Typical genotyping technologies give “no-calls” for approximately 5%-10% of a SNP dataset Four options: Modify tools to accommodate missing data Throw away SNPs Resequence Prohibitively expensive Impute Less accurate but “free” Strains A B C D E . 1 ? SNPs

Previous Imputation Approaches Hidden Markov Models (Stephens et al., 2001; Lin et al., 2002; Niu et al., 2002) Entropy Measures (Su et al., 2005) Expectation Maximization (Qin et al., 2002) Tree-Based Perfect Phylogeny (Eskin et al., 2003) Despite of their methodological differences, they have two things in common: Complex Slow

NPUTE A simple method for imputing missing genotypes based on a “nearest-neighbor” approach within arbitrary windows An efficient data structure for finding pairwise haplotype similarity This simplicity leads to benefits in: Speed Exhaustive searches over multiple parameters The result is a fast imputation approach with competitive accuracy.

Imputation Approach Ideal Method: Our Method: Within a haplotype block, find the nearest neighbor to the strain missing a genotype and fill it in with the neighbor’s value. Problem: Finding haplotype blocks is a very difficult and time consuming problem on its own. Our Method: Find the nearest neighbor within a window extending L SNPs above and below the missing value.

How to Find the Best Window We consider all symmetric windows of size 2L+1 for each practical L across the genome and use the closest match to “impute” all known values. Accuracy is estimated by imputing values of every known site for each L. The best L is an estimate of the average haplotype block size and is used for the imputation of “no-calls”.

Naïve Approach Strains A B C D E . 1 ? SNPs

Naïve Approach L = 2 . 1 ? Strains A B C D E SNPs Scoring Function ? 1 ? L = 2 SNPs Scoring Function ? 1 0.5

Naïve Approach Strains A B C D E . 1 ? L = 2 A B C E 1.5 2 3.5 SNPs

Naïve Approach Strains A B C D E . 1 ? L = 2 A B C E 1.5 2 3.5 SNPs

NPUTE Data Structures Begin with ternary SNPs Sij  {0, 1, ?} Build Pairwise Mismatch Vector (PMV) for each SNP (scaled by 2 to allow integer arithmetic) 0 = Match 1 = Unknown 2 = Mismatch Sum PMVs to make Mismatch Accumulator Array (MAA) Constant time lookup for the PMV over any window using row subtraction 2202 020 20 2 1 10010 MAA Mismatch Vector SNPs 12 56 32 21 62 57 16 54 50 47 14 58 32 21 62 57 18 54 52 49 16 60 35 21 62 58 20 55 54 50 16 62 35 23 64 58 22 57 54 52 17 64 35 23 65 59 23 59 56 52 18 65 35 25 66 60 24 60 57 54 10 54 32 19 62 55 16 52 50 45 10010 10001 011?0 00101 0?100 0??01 2202 020 20 2 2202 020 20 2 2220 002 02 2 2210 012 12 1 0202 202 20 2 1200 111 22 0 1102 111 11 2

NPUTE Approach . 1 ? Strains A B C D E SNPs MAA 35 59 52 32 55 52 45 3 12 56 32 21 62 57 16 54 50 47 14 58 32 21 62 57 18 54 52 49 16 60 35 21 62 58 20 55 54 50 16 62 35 23 64 58 22 57 54 52 17 64 35 23 65 59 23 59 56 52 18 65 35 25 66 60 24 60 57 54 10 54 32 19 62 55 16 52 50 45 . 1 ? SNPs 35 59 52 32 55 52 45 3 4 7

NPUTE on Real Data Perlegen Data (http://mouse.perlegen.com) 150K Data 8.3 million SNPs 16 mouse strains 11.1% missing calls 150K Data 140K Broad/MIT mouse dataset + 10K GNF mouse dataset 46 mouse strains 4.2% missing calls

NPUTE on Perlegen Data

NPUTE on Perlegen Data

NPUTE on Perlegen Data 8.3 Million SNPs with 16 strains We estimate that it will take 88 days for fastPhase to impute the data 60 s per imputation, ~135 minutes for entire dataset

NPUTE on 150K Data

NPUTE on 150K Data

65 s per imputation, ~7.5 minutes for the entire dataset NPUTE on 150K Data 150 K SNPs with 46 Strains 65 s per imputation, ~7.5 minutes for the entire dataset

Extensions to NPUTE We can establish a measure of confidence in our calls based on the fraction of matching values of the nearest neighbor. A threshold can be set to only impute high confidence calls. Imputation can proceed iteratively allowing high- confidence calls to aid in the imputation of lower-confidence calls.

Summary Available at http://compgen.unc.edu Better or competitive accuracy to alternative approaches Orders of magnitude faster O(NS2) space where N is the number of SNPs S is the number of strains O(S) time per imputation O(NS2) time for the whole genome Enables genome wide imputation Further optimization and extension

MAA is Versatile Small tweaks to the Mismatch Accumulator Array (MAA) support a variety of queries Finding local regions of Identity-by-descent Counting the number of unique haplotypes within arbitrary windows Query speed is independent of window size

Acknowledgement: EPA STAR RD832720 NSF IIS 0448392 NSF IIS 0534580 Questions?