Download presentation
Presentation is loading. Please wait.
Published byMeghan Robinson Modified over 9 years ago
1
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004
2
Outline SNP, haplotypes and genotypes Haplotype Inference Linear reduction method Improvements Experimental results Conclusions & future work
3
Human Genome and SNP Length of Human Genome 3 10 9 base pairs Difference between any two people 0.1% of genome 3 10 6 base pairs Total number of single nucleotide polymorphisms (SNP) 1 10 7 base pairs SNP’s are mostly bi-allelic, e.g., –two variants (alleles) out of 4 possible (A,C,T,G) = A/C –having a nucleotide in a certain position or missing it = A/- Major allele = more frequent allele = wild type vs SNP Minor allele (snip) frequency should be biologically considerable, e.g., over 1% There are more less frequent SNP
4
Haplotype and Disease Association Deafness inheritance moral problems SNP contribute to risk factors of complex diseases: –having certain SNP increases 10 times chances of having diabetes –but association is too “fragile” for doctors 3 10 -6 30 10 -6 –combinations of SNP’s = haplotypes are responsible for diseases International HapMap project: http://www.hapmap.org –SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. –HapMap tries to identify 1 million tag SNP’s providing almost as much mapping information as entire 10 million SNP’s –Unfortunately, not as much known about SNP combinations
5
Haplotypes and Genotypes Diploid organisms = two different “copies” of each chromosome = recombined copies of parents’ chromosomes Too expensive to examine two versions of a chromosome separately Much cheaper to obtain genotype (mixed) data rather than haplotype (separated) data Haplotype = description of single copy (0=wild type,1=minor allele) Genotype = description of mixed two copies (0=00, 1=11, 2=01) WABI 2004 0111 0 0 110 110 00 Twohaplotypesper individual 2121 0 0 120 Genotype for the individual 0111 0 0 110 110 00 Twohaplotypesper individual 2121 0 0 120 Genotype for the individual
6
Haplotype Inference Problem Haplotype Inference (HI) Problem: –Given: n genotype vectors (0, 1 or 2), –Find: n pairs of haplotype vectors, one pair of haplotypes per each genotype explaining genotypes For individual genotype with h heterozygous sites there are 2 h-1 possible haplotype pairs explaining this genotype This is hopeless without genetic model Parsimonious models minimize number of haplotypes WABI 2004
7
Computational Haplotype Inference Problem Assumptions: –small number of repeated mutations –small number of recombinations If data allow, then explain them only with mutations (perfect phylogeny) It is possible when there no 4-gamete rule violations: –for any pair of SNP’s only 3 combinations out of 4 (00/01/10/11) are present Fastest implemented algorithm DPPH Known programs for general data (with possible 4-gamete rule violations): –PHASE, HAPLOTYPER, HAP, Set-cover based, etc. WABI 2004
8
Reducing the Set of SNP’s Often many columns corresponding to SNP sites are analogous – one column can be obtained from another by swapping 0’s and 1’s One of such columns can be dropped – same as for two equal columns What would be generalization? –If one site is “dependent” (or can be reconstructed) from k other sites, then drop this dependent site – it does not carry any useful additional information General reduction method: –Encoding: reduce number of sites be removing dependent sites –Infer site-reduced haplotypes for the site-reduced genotypes using known haplotype inference method –Decoding: reconstruct dependent SNP’s from sites of reduced haplotypes Main requirement to reduction method – should be fast WABI 2004
9
Linear Dependence of SNP’s Consider linear dependence: –To make analogous sites linearly dependent – change notations: 0/1 -1/1 –Also for genotypes 0/1/2 -1/1/0 and genotype is half-sum of (linearly dependent from explaining haplotypes) Keep only linear independent SNP (tag SNP’s) – all other SNP can be reconstructed using linear combinations Equivalent factorization problem – find representation G = I X × H WABI 2004
10
Factorization Problem Factorization problem –Given a 0/1/-1 genotype matrix G –Find representation, G = I X × H where I X = graph incidence matrix (exactly two 1’s in each row) and H = -1/1 haplotype matrix Solution: –Factorize G = T × (E T |C) T = tags = basis of columns of G - solve factorization for T: T = I X × H’ - finally G = (I X × H’) × (E T |C) = I X × (H’× (E T |C))= I X × H WABI 2004
11
Linear Encoding Algorithm WABI 2004
12
Linear Decoding Algorithm WABI 2004
13
Graph-Based Decoding Extend haplotype graph X r obtained from HI algorithm to X m for all m sites Very often the graphs X r and X m are isomorphic, but not always Consider example –g1 = (1, 0, 1) and g2 = (0, -1, -1) –reduced set = (1,0) and (0,-1) The corresponding reduced haplotype graph has 3 vertices, while X m has 4 vertices The simple way is to split the vertices if we find an error WABI 2004
14
Handling Imperfect Phylogeny The genotype data may have indications of inconsistency with the perfect phylogeny model, 4 gamete rule violation We could choose h independent columns without such violation Algorithm in greedy manner WABI 2004
15
Experimental Results In Table 1, Our Results show that the advantage in runtime of Linearly Reduced DPPH grow fast with testcase size and reaches factor of 60 for largest instances. In all testcases, if DPPH find unique solution, so does the LR DPPH and the solution is identical. In Table 2 and 3, we can see the running time is drastically reduced compared to the original PHASE while the quality measured is not larger. In Table 4 and 5, we can see same advantage by using Linearly Reduced HAPLOTYPER instead original HAPLOTYPER. The last two data, we work on the real data from the drosophila haplotypes and human chromosome. WABI 2004
16
Experimental Results WABI 2004
17
Experimental Results WABI 2004
18
Conclusions and Future work Our method significantly speed up popular haplotype inference tools such as DPPH, HAPLOTYPER and PHASE in all cases thus not compromising the quality. We ever reach 50 faster than DPPH. Future work includes implement handling imperfect phylogeny algorithm. We are going to investigate an application of suggested linear reduction to finding a small number of representative sites sufficient to distinguish all haploytpes WABI 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.