Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
MALD Mapping by Admixture Linkage Disequilibrium.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Incorporating Mutations
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Fast Tag SNP Selection Wang Yue Joint work with Postdoc Guimei Liu and Prof Limsoon Wong.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Yufeng Wu and Dan Gusfield University of California, Davis
Lesson: Sequence processing
Constrained Hidden Markov Models for Population-based Haplotyping
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Introduction to SNP and Haplotype Analysis
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Approximation Algorithms for the Selection of Robust Tag SNPs
SNPs and CNPs By: David Wendel.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004

Outline SNP, haplotypes and genotypes Haplotype Inference Linear reduction method Improvements Experimental results Conclusions & future work

Human Genome and SNP Length of Human Genome  3  10 9 base pairs Difference between any two people  0.1% of genome  3  10 6 base pairs Total number of single nucleotide polymorphisms (SNP)  1  10 7 base pairs SNP’s are mostly bi-allelic, e.g., –two variants (alleles) out of 4 possible (A,C,T,G) = A/C –having a nucleotide in a certain position or missing it = A/- Major allele = more frequent allele = wild type vs SNP Minor allele (snip) frequency should be biologically considerable, e.g., over 1% There are more less frequent SNP

Haplotype and Disease Association Deafness inheritance  moral problems SNP contribute to risk factors of complex diseases: –having certain SNP increases 10 times chances of having diabetes –but association is too “fragile” for doctors 3   30  –combinations of SNP’s = haplotypes are responsible for diseases International HapMap project: –SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. –HapMap tries to identify 1 million tag SNP’s providing almost as much mapping information as entire 10 million SNP’s –Unfortunately, not as much known about SNP combinations

Haplotypes and Genotypes Diploid organisms = two different “copies” of each chromosome = recombined copies of parents’ chromosomes Too expensive to examine two versions of a chromosome separately Much cheaper to obtain genotype (mixed) data rather than haplotype (separated) data Haplotype = description of single copy (0=wild type,1=minor allele) Genotype = description of mixed two copies (0=00, 1=11, 2=01) WABI Twohaplotypesper individual Genotype for the individual Twohaplotypesper individual Genotype for the individual 

Haplotype Inference Problem Haplotype Inference (HI) Problem: –Given: n genotype vectors (0, 1 or 2), –Find: n pairs of haplotype vectors, one pair of haplotypes per each genotype explaining genotypes For individual genotype with h heterozygous sites there are 2 h-1 possible haplotype pairs explaining this genotype This is hopeless without genetic model Parsimonious models  minimize number of haplotypes WABI 2004

Computational Haplotype Inference Problem Assumptions: –small number of repeated mutations –small number of recombinations If data allow, then explain them only with mutations (perfect phylogeny) It is possible when there no 4-gamete rule violations: –for any pair of SNP’s only 3 combinations out of 4 (00/01/10/11) are present Fastest implemented algorithm DPPH Known programs for general data (with possible 4-gamete rule violations): –PHASE, HAPLOTYPER, HAP, Set-cover based, etc. WABI 2004

Reducing the Set of SNP’s Often many columns corresponding to SNP sites are analogous – one column can be obtained from another by swapping 0’s and 1’s One of such columns can be dropped – same as for two equal columns What would be generalization? –If one site is “dependent” (or can be reconstructed) from k other sites, then drop this dependent site – it does not carry any useful additional information General reduction method: –Encoding: reduce number of sites be removing dependent sites –Infer site-reduced haplotypes for the site-reduced genotypes using known haplotype inference method –Decoding: reconstruct dependent SNP’s from sites of reduced haplotypes Main requirement to reduction method – should be fast WABI 2004

Linear Dependence of SNP’s Consider linear dependence: –To make analogous sites linearly dependent – change notations: 0/1  -1/1 –Also for genotypes 0/1/2  -1/1/0 and genotype is half-sum of (linearly dependent from explaining haplotypes) Keep only linear independent SNP (tag SNP’s) – all other SNP can be reconstructed using linear combinations Equivalent factorization problem – find representation G = I X × H WABI 2004

Factorization Problem Factorization problem –Given a 0/1/-1 genotype matrix G –Find representation, G = I X × H where I X = graph incidence matrix (exactly two 1’s in each row) and H = -1/1 haplotype matrix Solution: –Factorize G = T × (E T |C) T = tags = basis of columns of G - solve factorization for T: T = I X × H’ - finally G = (I X × H’) × (E T |C) = I X × (H’× (E T |C))= I X × H WABI 2004

Linear Encoding Algorithm WABI 2004

Linear Decoding Algorithm WABI 2004

Graph-Based Decoding Extend haplotype graph X r obtained from HI algorithm to X m for all m sites Very often the graphs X r and X m are isomorphic, but not always Consider example –g1 = (1, 0, 1) and g2 = (0, -1, -1) –reduced set = (1,0) and (0,-1) The corresponding reduced haplotype graph has 3 vertices, while X m has 4 vertices The simple way is to split the vertices if we find an error WABI 2004

Handling Imperfect Phylogeny The genotype data may have indications of inconsistency with the perfect phylogeny model, 4 gamete rule violation We could choose h independent columns without such violation Algorithm in greedy manner WABI 2004

Experimental Results In Table 1, Our Results show that the advantage in runtime of Linearly Reduced DPPH grow fast with testcase size and reaches factor of 60 for largest instances. In all testcases, if DPPH find unique solution, so does the LR DPPH and the solution is identical. In Table 2 and 3, we can see the running time is drastically reduced compared to the original PHASE while the quality measured is not larger. In Table 4 and 5, we can see same advantage by using Linearly Reduced HAPLOTYPER instead original HAPLOTYPER. The last two data, we work on the real data from the drosophila haplotypes and human chromosome. WABI 2004

Experimental Results WABI 2004

Experimental Results WABI 2004

Conclusions and Future work Our method significantly speed up popular haplotype inference tools such as DPPH, HAPLOTYPER and PHASE in all cases thus not compromising the quality. We ever reach 50 faster than DPPH. Future work includes implement handling imperfect phylogeny algorithm. We are going to investigate an application of suggested linear reduction to finding a small number of representative sites sufficient to distinguish all haploytpes WABI 2004