Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut
Approaches to Phasing We propose novel tag SNP selection methods based on integer linear programming. Our methods –Allow computing the complete tradeoff curve between genotyping cost and reconstruction accuracy Yield improved reconstruction accuracy by taking haplotype frequencies into account Motivation and Contributions To reduce prohibitively expensive haplotyping costs, a two stage methodology has been recently proposed [3] –Pilot Study All SNPs of interest are genotyped in a small sample of the population Common haplotypes are inferred using statistical methods A set of tag SNPs is selected for the population study –Population Study Tag SNPs are genotyped in the remaining population Statistical methods are used to infer haplotypes over the tag SNPs Haplotypes over the tag SNPs are extrapolated to full haplotypes
Background A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals. In diploid organisms such as humans, there are two non- identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype. At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. The genotyping cost is affected by the number of SNPs typed. In order to reduce this cost, a small number of SNPs (Tag SNPs) which predicts the rest of SNPs are needed.
Previous Work on Tag SNPs Bafna et al.[1] : Informative SNP Set Problem –Find set of k SNPs with maximum “informativeness” Sebastiani et al. [5] : Best Enumeration SNP Tags (BEST) –Generates all optimum fully informative Tag SNPs sets –Limitation: worst-case runtime grows exponentially Barzuza et al.[2] : Phasing Tagging SNP problem –Find the minimum number of SNPs for which every two distinct haplotype pairs yield distinct (XOR) genotypes –Limitation: in practice, many pairs of haplotypes will give the same genotype even if all SNPs are used as tags Halperin et al.[4] : Genotype Tagging SNPs –Find set of k SNPs allowing most accurate genotype reconstruction n BEST time*<.01s 2s 29s 14m8s 6h4m 4d18h * running BEST on the n x n identity matrix
Optimum Fully Informative Tag SNP Sets by Integer Programming Given: haplotypes h 1, h 2, …, h m over n SNPs Find: minimum number of tag SNPs Such that: every two distinct haplotypes differ in at least one tag SNP Integer Program Formulation 0/1 variable x j for every SNP -x j = 1 if SNP j is selected as a tag SNP -x j = 0 otherwise Can be solved efficiently using general purpose solvers such as CPLEX -In practice significantly faster than BEST
Tag SNP Selection and Haplotype Reconstruction Flow Haplotype pairs (tag SNPs) Haplotype pairs (all SNPs) Sample haplotypes (with frequencies) Remaining Population Population Sample Tag SNP Set Genotype (tag SNPs) Extrapolation Phasing Tag Selection Pilot Study Population Study
Tag SNP Selection for Haplotype Reconstruction Reconstruction Errors Haplotypes not represented in sample population - Cannot be reconstructed! - Minimized by choosing sample large enough Incorrect inferred haplotypes over tag SNPs - Minimized by using accurate haplotype inference (phasing) methods - We use PHASE [6] for phasing sample genotypes as well as population genotypes over tag SNPs Incorrect haplotype extrapolation - Our extrapolation procedure - Find sample haplotype with minimum Hamming distance - Break ties according to the frequency of sample haplotypes (most frequent haplotypes are given preference) Informal Problem Definition Given: sample haplotypes and frequencies Find: K tag SNPs maximizing reconstruction accuracy
ILP Formulation (1) ILP1 0/1 variable x j set to 1 iff SNP j is selected as a tag SNP Only K SNPs can be selected 0/1 variable y i,i’ set to 1 iff haplotypes h i, h i’ are distinguished by at least one selected SNP Objective is to maximize informativeness, i.e., number of pairs of haplotypes distinguished by selected SNPs Integer program formulation similar to that for the fully informative tag SNP problem
ILPf : ILP with frequency ILP Formulation (2) Select K tag SNPs maximizing the total probability of distinguished pairs of haplotypes The probability of haplotype in the population is estimated from the initial sample using PHASE computed frequencies Reconstruction accuracy can be improved by considering haplotype frequencies
Datasets and Parameters: We used synthetic datasets generated following the methodology in [3] for 2 populations (European and West African) on 2 regions (IL8 and 5q31). For each of the 4 populations, we used haplotypes and frequencies inferred in [3] from the real data to generate 5 datasets containing between 200 and 1000 individuals. For each dataset, we picked 5 random samples with size 5 times the number of SNPs (we ran our algorithm using predetermined block sizes of 10 and 20). Random selections of Tag SNPs (Rand) were performed for comparison. Experimental Setup
Phasing Accuracy (%)
Error Analysis Correct haplotype pairs -Single-Correct: inferred haplotype pair over tag SNPs compatible with a single pair of sample haplotypes -Multi-Correct: inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is correct Incorrect haplotype pairs -Missing: one or both real haplotypes not present in sample population -Wrong Short: incorrect inferred haplotypes over tag SNPs -Multi-Wrong: inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is incorrect
Preliminary experiments show that use of the haplotype frequencies improves reconstruction accuracy compared to random selection and ILP1 In ongoing work we are extending our methods to reconstruction of long haplotypes by using integer program formulations based on overlapping blocks, and are comparing them to other reconstruction flows, including tag SNP based genotype reconstruction as in [4] followed by phasing References: 1.V. Bafna, B.V. Halldórsson, R.S. Schwartz, A.G. Clark, and S. Istrail, Haplotypes and informative SNP selection algorithms: Don’t block out information. RECOMB’03, pp , T. Barzuza, J.S. Beckmann, R. Shamir, and I. Pe’er, Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs, CPM 2004, LNCS 3109, pp. 14–31, J. Forton, D. Kwiatkowski, K. Rockett, G. Luoni, M. Kimber, and J. Hull, Accuracy of haplotype reconstruction from haplotype-tagging single-nucleotide polymorphisms, American Journal of Human Genetics, 76(3), pp , E. Halperin, G. Kimmel, and R. Shamir. Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy, Proc. ISMB P. Sebastiani, R. Lazarus, S.T. Weiss, L.M. Kunkel, I.S. Kohane, and M.F. Ramoni, Minimal haplotype tagging, Proc. National Academy of Sciences, 100(17), pp , M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, pp , Conclusions