Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky
Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work
Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work
Human Genome and SNPs Length of Human Genome 3 10 9 base pairs Difference b/w any people 0.1% of genome 3 10 6 SNPs Total #single nucleotide polymorphisms (SNP) 1 10 7 SNPs are mostly bi-allelic, e.g., alleles A and C Minor allele frequency should be considerable e.g. > 1% Diploid = two different copies of each chromosome Haplotype = description of single copy (0,1) Genotype = description of mixed two copies (0=00, 1=11, 2=01) Twohaplotypesper individual Genotype for the individual Twohaplotypesper individual Genotype for the individual
Haplotype and Disease Association Haplotypes/genotypes define our individuality Genetically engineered athletes might win at Beijing Olympics (Time (07/2004)) Haplotypes contribute to risk factors of complex diseases (e.g., diabetes) International HapMap project: –SNP’s causing disease reason are hidden among 10 million SNPs. –Too expensive to search –HapMap tries to identify 1 million tag SNPs providing almost as much mapping information as entire 10 million SNPs.
Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work
Tagging Reduces Cost Decrease SNP haplotyping cost: –sequence only small amount of SNPs = tag SNP –infer rest of (certain) SNPs based on sequenced tag SNPs Cost-saving ratio = m / k (infinite population) Traditional tagging = linkage disequilibrium (LD) needs too many SNPs, cost-saving ratio is too small (≈ 2) Proposed linear reduction method: cost-saving ratio ≈ 20 Number of SNPs: m Number of Tags : k
Haplotype Tagging Problem Given the full pattern of all SNPs for sample Find minimum number of tag SNPs that will allow for reconstructing the complete haplotype for each individual
Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work
Linear Rank of Recombinations Human Haplotype Evolution = –Mutations – introduce SNPs –Recombinations – propagate SNPs over entire population Replace notations (0, 1) with (–1, 1) Theorem: Haplotype population generated from l haplotypes with recombinations at k spots has linear rank (l- 1)(k+2) It is much less than number of all haplotypes = l k Conclusion: use only linearly independent SNP’s as tags
Tag SNPs Selection Tag Selecting Algorithm –Using Gauss-Jordan Elimination find Row Reduced Echelon Form (RREF) X of sample matrix S. –Extract the basis T of sample S –Factorize sample S = T X –Output set of tags T Fact: In sample, each SNP is a linear combination of tag SNPs Conjecture: In entire population, each SNP is same linear combination of tags as in sample Sample S rref X × tags T =
Haplotype Reconstruction –Given tags t of unknown haplotype h and RREF X of sample matrix S –Find unknown haplotype h –Predict the h’ = t X –We may have errors, since predicted h’ may not equal to unknown haplotype h. we assign –1 if predicted values are negative and +1 otherwise. (RLRP) –Variant : randomly reshuffle SNPs before choosing tags (RLR) Unknown haplotype h rref X Predicted haplotype h’ = tags set
Results for Simulated Data Cost-saving ratio for 2% error for LR is 3.9 and for RLRP is 13 P =1000 different haplotypes m =25000 sites Sample size = k (number of tag SNP’s) = 50,100,…,750
Results for Real Data Cost-saving ratio for 5% error for LR is 2.1 and for RLRP is 2.8 P =158 different haplotypes (Daly el.,) m =103 sites Sample size = k (number of tag SNP’s) = 10,15,20,…,90
Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work
Tag Separability Correlation between number of zeros for SNPs in RREF X and number of errors in prediction column Greedy heuristic gives a more separable basis. For 5% error, cost-saving ratio 2.8 vs 3.3 for RLRP
Conclusions and Future work Our contributions –new SNP tagging problem formulation –linear reduction method for SNP tagging –enhancement of linear reduction using separable basis Future work –application of tagging for genotype and haplotype disease association
Thank you