Download presentation
Presentation is loading. Please wait.
Published byAndrew Walsh Modified over 8 years ago
1
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A. Zelikovsky CS Department
2
International Workshop on Bioinformatics Research and Applications, May 2005 Overview SNP, Genotypes and Haplotypes Phasing & Missing Data Recovery for Trios Family trios & trio constraints ILP for Pure Parsimony Trio phasing without recombinations
3
International Workshop on Bioinformatics Research and Applications, May 2005 SNP, Genotypes and Haplotypes Length of Human Genome 3 10 9 #Single nucleotide polymorphism (SNPs) 1 10 7 SNPs are mostly biallelic, e.g., A C Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people 0.25% (b/w any 2 0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive) –example: 00110101 (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies –example 01122110 (0=00, 1=11, 2=01) International Hapmap project: www.hapmap.org
4
International Workshop on Bioinformatics Research and Applications, May 2005 Population Phasing Problem Given genotype n m matrix G – n genotype-rows with m snips-columns Find haplotype 2n m matrix H –2n haplotyp-rows with m snips-columns –each g genotype is explained with two haplotypes h1,h2 h1 = 0011010 h2 = 0110110 g = 0212210 Remarks: –For an individual with k heterozygous sites (2’s), 2 k-1 haplotype pairs can be a possible solution –This is hopeless without a genetic model –Programs: PHASE, HAPLOTYPER, HAP, GERBIL, DPPH, etc.
5
International Workshop on Bioinformatics Research and Applications, May 2005 Family Trios & Trio Constraints Common genotype data are in family trios consisting of two parents and one offspring Trio data allows to recover offspring haplotypes with higher confidence. Haplotype reconstruction should satisfy trio constraints. Example: –If genotypes are f=22 m=02 k=01 –Then haplotypes are f1=10 m1=01 k1=01 f2=01 m2=00 k2=01 Only if f=m=k=22, the ambiguity remains
6
International Workshop on Bioinformatics Research and Applications, May 2005 Family Trio Phasing Parental Trio Phasing Problem –Given a set of genotype partitioned into family trios –Find for each trio a quartet of parent haplotypes which agree with all three genotypes: Parental haplotypes agree with parental genotypes Inherited parental haplotypes agree with offspring genotype General Trio Phasing Problem –Find (additionally) for each offspring the “true” recombination of inherited parental haplotypes
7
International Workshop on Bioinformatics Research and Applications, May 2005 ILP for Parental Trio Phasing Introduce four template haplotypes {0,1,2,?} Variables: x -- for each possible haplotype y -- for each 2 Objective: Constraints:
8
International Workshop on Bioinformatics Research and Applications, May 2005 Results
9
International Workshop on Bioinformatics Research and Applications, May 2005 Trio Phasing w/o Crossovers Three phasing methods on the real and simulated data sets Error = % of sites where (best choice of) inherited paternal and maternal haplotypes disagree with the offspring genotype. D = Hamming distance in % between the phased haplotypes and the closest feasible haplotypes.
10
International Workshop on Bioinformatics Research and Applications, May 2005 Trio Phasing w/o Crossovers parent/offspring- feasible phasings trio-feasible phasings PHASE pure parsimonious = no recombinations random Projections = closest trio-feasible
11
International Workshop on Bioinformatics Research and Applications, May 2005 Missing Data Recovery Problem Real data often miss some snips –Daly et al data (Chron Disease) 10%-16% –Gabriel et al data (Hapmap) 7%-10% How to reconstruct missing values? How to verify reconstruction method? –Scramble extra 10% and reconstruct them Karp-Halperin (2004) have error rate 2.8%
12
International Workshop on Bioinformatics Research and Applications, May 2005 Results for Trio Missing Data Recovery
13
International Workshop on Bioinformatics Research and Applications, May 2005 Missing Data Recovery Problem
14
International Workshop on Bioinformatics Research and Applications, May 2005 Diploid - two haplotypes (different copies of each chromosome) SNP - single nucleotide site where two or more different nucleotides occur in a large percentage of population –0 = willde type/major (frequency) allele –1 = mutation/minor (frequency) allele Haplotype - description of a single copy –Example: 00110101 (0 is for major, 1 is for minor allele) Genotype - description of the mixed two copies –Example: 01122110 (0=00, 1=11, 2=01)
15
International Workshop on Bioinformatics Research and Applications, May 2005 Formulating the Pure-parsimony Trio Phasing Problem(PTPP) and the Trio Missing Data Recovery Problem (TMDRP) Two new greedy and integer linear programming (ILP) based methods solving PTPP and TMDRP New 2-SNP Statistics (2SNP) phasing method for unrelated individuals Extensive experimental validation of proposed methods and comparison with the previously known methods
16
International Workshop on Bioinformatics Research and Applications, May 2005 PHASE – Bayesian statistical method (Stephens et al., 2001, 2003) HAPLOTYPER – proposed a Monte Carlo aproach (Niu et al., 2002) Phamily – phase the trio families based on PHASE (Acherman et al., 2003) Greedy method for phasing and missing data recovery–by (Halperin and Karp, 2004) GERBIL – statistical method using maximum likelihood (ML), MST and expectation-maximization (EM) (Kimmel and Shamir, 2005) SNPHAP – use ML/EM assuming Hardy-Weinberg equilibrium (Clayton et al., 2004)
17
International Workshop on Bioinformatics Research and Applications, May 2005 Given a set of family trios of genotypes each with m sites corresponding to m SNPs: –0 – homozygote with major allele, 1 – homozygote with minor allele, 2 – heterozygote, ? – missing SNP value Find for each trio four haplotypes h1, h2, h3, h4 each with m 0-1-sites such that: –h1 and h2 explain father’s genotype, h3 and h4 explain mother’s genotype, h1 and h3 explain offspring’s genotype
18
International Workshop on Bioinformatics Research and Applications, May 2005 Easy to find a feasible solution to TPP (exponential number of feasible solutions) We pursue parsimonious objective,i.e., minimization of the total number of haplotypes Drawback of PP is that when the number of SNPs becomes large (as well as the number of recombinations), then the quality of pure parsimony phasing is diminishing Partition the genotypes into blocks In case of trio data we do not have joining blocks problem Pure-Parsimony Trio Phasing (PPTP). Given 3n genotypes corresponding to n family trios find minimum number of distinct haplotypes explaining all trios
19
International Workshop on Bioinformatics Research and Applications, May 2005 Proposed by Halperin et al. in “Perfect phylogeny and haplotype assignment” (2004) For each trio we introduce four partial haplotypes with SNPs 0, 1 and ? Algorithm iteratively finds the complete haplotype which covers the maximum possible number of partial haplotypes, removes this set of resolved partial haplotypes and continues in that manner The drawback of this method is introducing errors to trio constraint
20
International Workshop on Bioinformatics Research and Applications, May 2005 For each trio we introduce four template haplotypes {0,1,2,?} –0,1 – correspond to fully resolved haplotypes, 2 – comes in SNPs corresponding to the genotypes 2’s, ? – unconstrained SNPs Variables: –for each possible haplotype i, xi {0,1}, –for each heterozigous SNP j in each template, yj {0,1}
21
International Workshop on Bioinformatics Research and Applications, May 2005
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.