National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Slides:



Advertisements
Similar presentations
Population Genetics 3 We can learn a lot about the origins and movements of populations from genetics Did all modern humans come from Africa? Are we derived.
Advertisements

CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Wei-Bung Wang Tao Jiang
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning:
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Notes: Human Genome (Right side page)
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
Consideration for Planning a Candidate Gene Association Study With TagSNPs Shehnaz K. Hussain, PhD, ScM Epidemiology 243: Molecular.
School of Pharmacy, University of Nizwa
Introduction to SNP and Haplotype Analysis
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Ho Kim School of Public Health Seoul National University
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao

National Taiwan University Department of Computer Science and Information Engineering 2 Genetic Variations The genetic variations in DNA sequences (e.g., insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences.  All humans share 99% the same DNA sequence.  The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence.

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity.  SNP: Single DNA base variation found >1%  Mutation: Single DNA base variation found <1% C T T A G C T T C T T A G T T T SNP C T T A G C T T C T T A G T T T Mutation 94% 6% 99.9% 0.1%

National Taiwan University Department of Computer Science and Information Engineering 4 Mutations and SNPs Common Ancestor timepresent Observed genetic variations Mutations SNPs

National Taiwan University Department of Computer Science and Information Engineering 5 Single Nucleotide Polymorphism SNPs are the most frequent form among various genetic variations.  90% of human genetic variations come from SNPs.  SNPs occur about every 300~600 base pairs.  Millions of SNPs have been identified (e.g., HapMap and Perlegen). SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism A SNP is usually assumed to be a binary variable.  The probability of repeat mutation at the same SNP locus is quite small.  The tri-allele cases are usually considered to be the effect of genotyping errors. The nucleotide on a SNP locus is called  a major allele (if allele frequency > 50%), or  a minor allele (if allele frequency < 50%). A C T T A G C T T A C T T A G C T C C: Minor allele 94% 6% T: Major allele

National Taiwan University Department of Computer Science and Information Engineering 7 Haplotypes A haplotype stands for an ordered list of SNPs on the same chromosome.  A haplotype can be simply considered as a binary string since each SNP is binary. SNP 1 SNP 2 SNP 3 -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP 1 SNP 2 SNP 3

National Taiwan University Department of Computer Science and Information Engineering 8 Genotypes The use of haplotype information has been limited because the human genome is a diploid.  In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. A C G T AT SNP 1 SNP 2 CG Haplotype data SNP 1 SNP 2 Genotype data ACAC GTGT SNP 1 SNP 2 A T C G SNP 1 SNP 2

National Taiwan University Department of Computer Science and Information Engineering 9 Problems of Genotypes Genotypes only tell us the alleles at each SNP locus.  But we don’t know the connection of alleles at different SNP loci.  There could be several possible haplotypes for the same genotype. A C G T SNP 1 SNP 2 Genotype data or AT CG SNP 1 SNP 2 AG CT SNP 1 SNP 2 ACAC GTGT SNP 1 SNP 2 We don’t know which haplotype pair is real.

National Taiwan University Department of Computer Science and Information Engineering 10 Research Directions of SNPs and Haplotypes in Recent Years Haplotype Inference Tag SNP Selection Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy SNP Database …

National Taiwan University Department of Computer Science and Information Engineering 11 Haplotype Inference The problem of inferring the haplotypes from a set of genotypes is called haplotype inference.  This problem is already known to be not only NP-hard but also APX-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem.  This model assumes that the real haplotypes in natural population is rare.  The solution of this problem is a minimum set of haplotypes that can explain the given genotypes.

National Taiwan University Department of Computer Science and Information Engineering 12 Maximum Parsimony AG h3h3 CT h4h4 AT h1h1 CG h2h2 AT h1h1 AT h1h1 or G1G1 A C SNP 1 SNP 2 G T G2G2 A A SNP 1 SNP 2 T T AG CT AT AT CG Find a minimum set of haplotypes to explain the given genotypes.

National Taiwan University Department of Computer Science and Information Engineering 13 Related Work Statistical methods:  Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER.  Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE. Combinatorial methods:  Gusfield (2003) proposed an integer linear programming algorithm.  Wang and Xu (2003) developed a branching and bound algorithm called HAPAR to find the optimal solution.  Brown and Harrower (2004) proposed a new integer linear formulation of this problem.

National Taiwan University Department of Computer Science and Information Engineering 14 Our Results We formulated this problem as an integer quadratic programming (IQP) problem. We proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem.  This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in MatLab and compared with existing methods.  Huang, Y.-T., Chao, K.-M., and Chen, T., 2005, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony,” Journal of Computational Biology, 12: An Approximation Algorithm for Haplotype Inference by Maximum Parsimony

National Taiwan University Department of Computer Science and Information Engineering 15 Problem Formulation Input:  A set of n genotypes and m possible haplotypes. Output:  A minimum set of haplotypes that can explain the given genotypes. AT h1h1 CG h2h2 AT h1h1 AT h1h1 G1G1 A C SNP 1 SNP 2 G T G2G2 A A SNP 1 SNP 2 T T AT h1h1 CG h2h2

National Taiwan University Department of Computer Science and Information Engineering 16 Integer Quadratic Programming (IQP) Define x i as an integer variable with values 1 or -1.  x i = 1 if the i-th haplotype is selected.  x i = -1 if the i-th haplotype is not selected. Minimizing the number of selected haplotypes is to minimize the following integer quadratic function:

National Taiwan University Department of Computer Science and Information Engineering 17 Integer Quadratic Programming (IQP) Each genotype must be resolved by at least one pair of haplotypes.  For genotype G 1, the following integer quadratic function must be satisfied. G1G1 A C SNP 1 SNP 2 G T AT h1h1 CG h2h2 AG h3h3 CT h4h4 or 11 Suppose h 1 and h 2 are selected

National Taiwan University Department of Computer Science and Information Engineering 18 Integer Quadratic Programming (IQP) Maximum parsimony: We use the SDP-relaxation technique to solve this IQP problem. Objective Function Constraint Functions to resolve all genotypes. Find a minimum set of haplotypes

National Taiwan University Department of Computer Science and Information Engineering 19 The Flow of the Iterative SDP Relaxation Algorithm Integer Quadratic Programming Integral Solution Semidefinite Programming Vector Solution Vector Formulation SDP Solution All genotypes resolved? Relax the integer constraint No, repeat this algorithm. Existing SDP solver Yes, done. Reformulation Randomized rounding Incomplete Cholesky decomposition NP-hardP