June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca
June 2, Content Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem Inference of incomplete perfect phylogeny: algorithms Incomplete pph and missing data Other models: open problems
June 2, Biological terms Diploid organism haplotype A A A maternal G C A paternal genotype homozygous heterozygous i i+1 i+2 Biallelic site i |Value( i ) { A,C,G,T}| 2
June 2, Motivations Human genetic variations are related to diseases ( cancers, diabetes, osteoporoses ) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are demanded Ongoing international HapMap project: find haplotype differences on large scaleHapMap population data Combinatorial methods: graphs Set-cover problems Optimization problems
June 2, Haplotyping: the formal model Haplotype: m-vector h= over {0,1} m Genotype: m-sequence g= over {0,1,*} Def. Haplotypes solve genotype g iff : g(i)=* implies h(i) k(i) h(i)= k(i)= g(i) otherwise * 01 g =
June 2, Examples g = h= k= g solved by g k Clark inference rule g 1 = g 2 = h 1 = g 3 = h 2 = h 1 = g 2 = h 2 = h 1 = h 3 = g 3 = h g 1 h 2 h1h1
June 2, Haplotype inference: the general problem Problem HI: Instance: a set G={g 1, …,g m } of genotypes and a set H={h 1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H H’. H’ derives from an inference RULE
June 2, Type of inference rules Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related to genotypes by a tree model Pedigree data: haplotypes are related to genotypes by a directed graph
June 2, Mendelian law and Recombination BA Father CD Mother ACADBCDB C1C2C3C4 BDBD ACAC Parent ACAC BDBD ADAD BCBC Child:
June 2, Pedigree Pedigree, nuclear family, founder
June 2, Pedigree Pedigree, nuclear family, founder Father Mother Children ID Num Genotypes Founders Nuclear family Family trio loop Mating node
June 2, Haplotyping from genotypes: The problem & methods Problem: Input: genotype data (missing). Output: haplotypes. Input data: Data with pedigree (dependent). Data without pedigree info (independent). Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive Rule-based methods Define rules based on some plausible assumptions and find those haplotypes consistent with these rules. Adv: usually simple thus very fast Disadv: no numerical assessment of the reliability of the results
June 2, HI by the perfect phylogeny model IDEA: 0, 1,1,0,1 0, 1,0,1,1 g1= 0, 1,*,*,1 g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 GH Genotypes are the mating of haplotypes in a tree Given G find H and T that explain G! 00000
June 2, Perfect Phylogeny models Input data: 0-1 matrix A characters, species Output data: phylogeny for A s1s1 s2s2 s3s3 s4s4 c1c1 c3c3 c2c2 c5c5 c4c Path c 3 c 4 s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 R
June 2, Perfect phylogeny each row s i labels exactly one leaf of T each column c j labels exactly one edge of T each internal edge labelled by at least one column c j row s i gives the 0,1 path from the root to s i Def. A pp T for a 0-1 matrix A: s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 Path c 3 c
June 2, pp model: another view L(x) cluster of x: set of leaves of T x s4s4 s2s2 s1s1 s3s3 x A pp is associated to a tree-family (S,C) with S={s 1,…, s n } C={S’ S: S’ is a cluster} s.t. X, Y in C, if X Y then X Y or Y X.
June 2, pp : another view A tree-family (S,C) is represented by a 0-1 matrix: c i c i S’ : s j S’ iff b ji =1 s j Lemma A 0-1 matrix is a pp iff it represents a tree-family for each set in C at least a column
June 2, Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: s i haplotype c i SNPs sisi c i 0-1 switch in position i only once in the tree !! SNP site
June 2, Haplotyping and the pp: observations The root of T may not be the haplotype switch or 1-0 switch (directed case) 0-1 switch switch
June 2, HI problem in the pp model Input data: a 0-1-*matrix B n m of genotypes G Output data: a 0-1 matrix B’ 2n m of haplotypes s.t. (1) each g G is solved by a pair of rows in B’ (2) B’ has a pp (tree family) DECISION Problem 0, 1,0,1,1 01*1*001* 001*11* *1*1* ???
June 2, An example a * * b 0 * c 1 0 a 1 0 a’ 0 1 b 0 1 b’ 0 0 c 1 0 c’ 1 0 a c c’ b’ a’b
June 2, The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm 2 )- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog 2 (n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004
June 2, IDP problem OPEN PROBLEM: find an optimal algorithm ?? C1C1 C 2 C 4 C5C5 C3 C3 S2S2 S1S1 S3S3 1 ? ? ? ? 0 1 ? ? ? ? ? 0 1 ? ? ? 0 1 ? ? ? ? 0 1 ? ? Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”
June 2, Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate XY Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph c C’ s1s1 s3s3 s2s
June 2, Test: a 0-1 matrix A has a pp? O(nm) algorithm ( Gusfield 1991 ) Steps: 1. Given A order {c 1, …,c m } as (decreasing) binary numbers A’ 2. Let L(i,j)=k, k = max{l <j: A’[i,l]=1} 3. Let index(j) = max{L(i,j): i} 4. Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1
June 2, Idea:
June 2, The IDP algorithm c C’ s1s1 s3s3 s2s2
June 2, Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise Igpp has polynomial solution under rich data hypothesis ( Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise
June 2, HI problem and other models Haplotype inference in pedigree data under the recombination model maternal paternal recombination child
June 2, Pedigree graph Single Mating Pedigree Tree Mating loop Nuclear family Pedigree Graph fathermather child
June 2, Haplotype inference in pedigree |0 0|1 1|0 1|1 0| |0 1|0 0|0 0|1 0|0 1|0 0|1 1|1 0|0 Paternalmaternal |1 1|1 1|0
June 2, Problems: MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI) OPEN Np-complete even if the graph is acyclic, but unbounded number of children…
June 2, Conclusions
June 2, References