Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang ChenUniversity of Texas Pan American.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
The HAP webserver: Tools for the Discovery of Genetic Basis of Human Disease HYUN MIN KANG Computer Science and Engineering University of California, San.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
A Two Phase Approach for Minimal Diagnostic Test Set Generation Mohammed Ashfaq Shukoor Vishwani D. Agrawal 14th IEEE European Test Symposium Seville,
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *
Imputation-based local ancestry inference in admixed populations
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Yufeng Wu and Dan Gusfield University of California, Davis
Introduction to SNP and Haplotype Analysis
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
How Accurate is Pure Parsimony Haplotype Inferencing
Introduction to SNP and Haplotype Analysis
Discrete Event Simulation - 4
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS 394C: Computational Biology Algorithms
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Haplotype Block Partition with Limited Resources and Applications to Human Chromosome 21 Haplotype Data  Kui Zhang, Fengzhu Sun, Michael S. Waterman,
Presentation transcript:

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut

Approaches to Phasing We propose novel tag SNP selection methods based on integer linear programming. Our methods –Allow computing the complete tradeoff curve between genotyping cost and reconstruction accuracy Yield improved reconstruction accuracy by taking haplotype frequencies into account Motivation and Contributions To reduce prohibitively expensive haplotyping costs, a two stage methodology has been recently proposed [3] –Pilot Study All SNPs of interest are genotyped in a small sample of the population Common haplotypes are inferred using statistical methods A set of tag SNPs is selected for the population study –Population Study Tag SNPs are genotyped in the remaining population Statistical methods are used to infer haplotypes over the tag SNPs Haplotypes over the tag SNPs are extrapolated to full haplotypes

Background A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals. In diploid organisms such as humans, there are two non- identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype. At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. The genotyping cost is affected by the number of SNPs typed. In order to reduce this cost, a small number of SNPs (Tag SNPs) which predicts the rest of SNPs are needed.

Previous Work on Tag SNPs Bafna et al.[1] : Informative SNP Set Problem –Find set of k SNPs with maximum “informativeness” Sebastiani et al. [5] : Best Enumeration SNP Tags (BEST) –Generates all optimum fully informative Tag SNPs sets –Limitation: worst-case runtime grows exponentially Barzuza et al.[2] : Phasing Tagging SNP problem –Find the minimum number of SNPs for which every two distinct haplotype pairs yield distinct (XOR) genotypes –Limitation: in practice, many pairs of haplotypes will give the same genotype even if all SNPs are used as tags Halperin et al.[4] : Genotype Tagging SNPs –Find set of k SNPs allowing most accurate genotype reconstruction n BEST time*<.01s 2s 29s 14m8s 6h4m 4d18h * running BEST on the n x n identity matrix

Optimum Fully Informative Tag SNP Sets by Integer Programming Given: haplotypes h 1, h 2, …, h m over n SNPs Find: minimum number of tag SNPs Such that: every two distinct haplotypes differ in at least one tag SNP Integer Program Formulation 0/1 variable x j for every SNP -x j = 1 if SNP j is selected as a tag SNP -x j = 0 otherwise Can be solved efficiently using general purpose solvers such as CPLEX -In practice significantly faster than BEST

Tag SNP Selection and Haplotype Reconstruction Flow Haplotype pairs (tag SNPs) Haplotype pairs (all SNPs) Sample haplotypes (with frequencies) Remaining Population Population Sample Tag SNP Set Genotype (tag SNPs) Extrapolation Phasing Tag Selection Pilot Study Population Study

Tag SNP Selection for Haplotype Reconstruction Reconstruction Errors Haplotypes not represented in sample population - Cannot be reconstructed! - Minimized by choosing sample large enough Incorrect inferred haplotypes over tag SNPs - Minimized by using accurate haplotype inference (phasing) methods - We use PHASE [6] for phasing sample genotypes as well as population genotypes over tag SNPs Incorrect haplotype extrapolation - Our extrapolation procedure - Find sample haplotype with minimum Hamming distance - Break ties according to the frequency of sample haplotypes (most frequent haplotypes are given preference) Informal Problem Definition Given: sample haplotypes and frequencies Find: K tag SNPs maximizing reconstruction accuracy

ILP Formulation (1) ILP1 0/1 variable x j set to 1 iff SNP j is selected as a tag SNP Only K SNPs can be selected 0/1 variable y i,i’ set to 1 iff haplotypes h i, h i’ are distinguished by at least one selected SNP Objective is to maximize informativeness, i.e., number of pairs of haplotypes distinguished by selected SNPs Integer program formulation similar to that for the fully informative tag SNP problem

ILPf : ILP with frequency ILP Formulation (2) Select K tag SNPs maximizing the total probability of distinguished pairs of haplotypes The probability of haplotype in the population is estimated from the initial sample using PHASE computed frequencies Reconstruction accuracy can be improved by considering haplotype frequencies

Datasets and Parameters: We used synthetic datasets generated following the methodology in [3] for 2 populations (European and West African) on 2 regions (IL8 and 5q31). For each of the 4 populations, we used haplotypes and frequencies inferred in [3] from the real data to generate 5 datasets containing between 200 and 1000 individuals. For each dataset, we picked 5 random samples with size 5 times the number of SNPs (we ran our algorithm using predetermined block sizes of 10 and 20). Random selections of Tag SNPs (Rand) were performed for comparison. Experimental Setup

Phasing Accuracy (%)

Error Analysis Correct haplotype pairs -Single-Correct: inferred haplotype pair over tag SNPs compatible with a single pair of sample haplotypes -Multi-Correct: inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is correct Incorrect haplotype pairs -Missing: one or both real haplotypes not present in sample population -Wrong Short: incorrect inferred haplotypes over tag SNPs -Multi-Wrong: inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is incorrect

Preliminary experiments show that use of the haplotype frequencies improves reconstruction accuracy compared to random selection and ILP1 In ongoing work we are extending our methods to reconstruction of long haplotypes by using integer program formulations based on overlapping blocks, and are comparing them to other reconstruction flows, including tag SNP based genotype reconstruction as in [4] followed by phasing References: 1.V. Bafna, B.V. Halldórsson, R.S. Schwartz, A.G. Clark, and S. Istrail, Haplotypes and informative SNP selection algorithms: Don’t block out information. RECOMB’03, pp , T. Barzuza, J.S. Beckmann, R. Shamir, and I. Pe’er, Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs, CPM 2004, LNCS 3109, pp. 14–31, J. Forton, D. Kwiatkowski, K. Rockett, G. Luoni, M. Kimber, and J. Hull, Accuracy of haplotype reconstruction from haplotype-tagging single-nucleotide polymorphisms, American Journal of Human Genetics, 76(3), pp , E. Halperin, G. Kimmel, and R. Shamir. Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy, Proc. ISMB P. Sebastiani, R. Lazarus, S.T. Weiss, L.M. Kunkel, I.S. Kohane, and M.F. Ramoni, Minimal haplotype tagging, Proc. National Academy of Sciences, 100(17), pp , M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, pp , Conclusions