International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Introduction to Haplotype Estimation Stat/Biostat 550.
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
University of Connecticut
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Incorporating Mutations
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang Dept of Computer Science University of California – Riverside (joint work with.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
Imputation 2 Presenter: Ka-Kit Lam.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
The Haplotype Blocks Problems Wu Ling-Yun
Introduction to SNP and Haplotype Analysis
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
Linkage: Statistically, genes act like beads on a string
How Accurate is Pure Parsimony Haplotype Inferencing
Introduction to SNP and Haplotype Analysis
Efficient Haplotype Inference on Pedigrees and Applications
Haplotype Reconstruction
Chromosomal Haplotypes by Genetic Phasing of Human Families
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Ho Kim School of Public Health Seoul National University
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A. Zelikovsky CS Department

International Workshop on Bioinformatics Research and Applications, May 2005 Overview SNP, Genotypes and Haplotypes Phasing & Missing Data Recovery for Trios Family trios & trio constraints ILP for Pure Parsimony Trio phasing without recombinations

International Workshop on Bioinformatics Research and Applications, May 2005 SNP, Genotypes and Haplotypes Length of Human Genome  3  10 9 #Single nucleotide polymorphism (SNPs)  1  10 7 SNPs are mostly biallelic, e.g., A  C Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people  0.25% (b/w any 2  0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive) –example: (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies –example (0=00, 1=11, 2=01) International Hapmap project:

International Workshop on Bioinformatics Research and Applications, May 2005 Population Phasing Problem Given genotype n  m matrix G – n genotype-rows with m snips-columns Find haplotype 2n  m matrix H –2n haplotyp-rows with m snips-columns –each g genotype is explained with two haplotypes h1,h2 h1 = h2 = g = Remarks: –For an individual with k heterozygous sites (2’s), 2 k-1 haplotype pairs can be a possible solution –This is hopeless without a genetic model –Programs: PHASE, HAPLOTYPER, HAP, GERBIL, DPPH, etc.

International Workshop on Bioinformatics Research and Applications, May 2005 Family Trios & Trio Constraints Common genotype data are in family trios consisting of two parents and one offspring Trio data allows to recover offspring haplotypes with higher confidence. Haplotype reconstruction should satisfy trio constraints. Example: –If genotypes are f=22 m=02 k=01 –Then haplotypes are f1=10 m1=01 k1=01 f2=01 m2=00 k2=01 Only if f=m=k=22, the ambiguity remains

International Workshop on Bioinformatics Research and Applications, May 2005 Family Trio Phasing Parental Trio Phasing Problem –Given a set of genotype partitioned into family trios –Find for each trio a quartet of parent haplotypes which agree with all three genotypes: Parental haplotypes agree with parental genotypes Inherited parental haplotypes agree with offspring genotype General Trio Phasing Problem –Find (additionally) for each offspring the “true” recombination of inherited parental haplotypes

International Workshop on Bioinformatics Research and Applications, May 2005 ILP for Parental Trio Phasing Introduce four template haplotypes {0,1,2,?} Variables: x -- for each possible haplotype y -- for each 2 Objective: Constraints:

International Workshop on Bioinformatics Research and Applications, May 2005 Results

International Workshop on Bioinformatics Research and Applications, May 2005 Trio Phasing w/o Crossovers Three phasing methods on the real and simulated data sets Error = % of sites where (best choice of) inherited paternal and maternal haplotypes disagree with the offspring genotype. D = Hamming distance in % between the phased haplotypes and the closest feasible haplotypes.

International Workshop on Bioinformatics Research and Applications, May 2005 Trio Phasing w/o Crossovers parent/offspring- feasible phasings trio-feasible phasings PHASE pure parsimonious = no recombinations random Projections = closest trio-feasible

International Workshop on Bioinformatics Research and Applications, May 2005 Missing Data Recovery Problem Real data often miss some snips –Daly et al data (Chron Disease) 10%-16% –Gabriel et al data (Hapmap) 7%-10% How to reconstruct missing values? How to verify reconstruction method? –Scramble extra 10% and reconstruct them Karp-Halperin (2004) have error rate 2.8%

International Workshop on Bioinformatics Research and Applications, May 2005 Results for Trio Missing Data Recovery

International Workshop on Bioinformatics Research and Applications, May 2005 Missing Data Recovery Problem

International Workshop on Bioinformatics Research and Applications, May 2005 Diploid - two haplotypes (different copies of each chromosome) SNP - single nucleotide site where two or more different nucleotides occur in a large percentage of population –0 = willde type/major (frequency) allele –1 = mutation/minor (frequency) allele Haplotype - description of a single copy –Example: (0 is for major, 1 is for minor allele) Genotype - description of the mixed two copies –Example: (0=00, 1=11, 2=01)

International Workshop on Bioinformatics Research and Applications, May 2005 Formulating the Pure-parsimony Trio Phasing Problem(PTPP) and the Trio Missing Data Recovery Problem (TMDRP) Two new greedy and integer linear programming (ILP) based methods solving PTPP and TMDRP New 2-SNP Statistics (2SNP) phasing method for unrelated individuals Extensive experimental validation of proposed methods and comparison with the previously known methods

International Workshop on Bioinformatics Research and Applications, May 2005 PHASE – Bayesian statistical method (Stephens et al., 2001, 2003) HAPLOTYPER – proposed a Monte Carlo aproach (Niu et al., 2002) Phamily – phase the trio families based on PHASE (Acherman et al., 2003) Greedy method for phasing and missing data recovery–by (Halperin and Karp, 2004) GERBIL – statistical method using maximum likelihood (ML), MST and expectation-maximization (EM) (Kimmel and Shamir, 2005) SNPHAP – use ML/EM assuming Hardy-Weinberg equilibrium (Clayton et al., 2004)

International Workshop on Bioinformatics Research and Applications, May 2005 Given a set of family trios of genotypes each with m sites corresponding to m SNPs: –0 – homozygote with major allele, 1 – homozygote with minor allele, 2 – heterozygote, ? – missing SNP value Find for each trio four haplotypes h1, h2, h3, h4 each with m 0-1-sites such that: –h1 and h2 explain father’s genotype, h3 and h4 explain mother’s genotype, h1 and h3 explain offspring’s genotype

International Workshop on Bioinformatics Research and Applications, May 2005 Easy to find a feasible solution to TPP (exponential number of feasible solutions) We pursue parsimonious objective,i.e., minimization of the total number of haplotypes Drawback of PP is that when the number of SNPs becomes large (as well as the number of recombinations), then the quality of pure parsimony phasing is diminishing Partition the genotypes into blocks In case of trio data we do not have joining blocks problem Pure-Parsimony Trio Phasing (PPTP). Given 3n genotypes corresponding to n family trios find minimum number of distinct haplotypes explaining all trios

International Workshop on Bioinformatics Research and Applications, May 2005 Proposed by Halperin et al. in “Perfect phylogeny and haplotype assignment” (2004) For each trio we introduce four partial haplotypes with SNPs 0, 1 and ? Algorithm iteratively finds the complete haplotype which covers the maximum possible number of partial haplotypes, removes this set of resolved partial haplotypes and continues in that manner The drawback of this method is introducing errors to trio constraint

International Workshop on Bioinformatics Research and Applications, May 2005 For each trio we introduce four template haplotypes {0,1,2,?} –0,1 – correspond to fully resolved haplotypes, 2 – comes in SNPs corresponding to the genotypes 2’s, ? – unconstrained SNPs Variables: –for each possible haplotype i, xi  {0,1}, –for each heterozigous SNP j in each template, yj  {0,1}

International Workshop on Bioinformatics Research and Applications, May 2005