Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-free Mendelian Inheritance on a Pedigree Authors: Lan Liu & Tao Jiang, Univ. California, Riverside Jing Xiao, Lirong Xia, Tsinghua Univ., China
Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn 3 ) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion
Pedigree An example: British Royal Family
Biological Background Basic concepts Mendelian Law : one haplotype comes from the father and the other comes from the mother. Example: Mendelian experiment paternal maternal 12: heterozgyous 11 22: homozygous 2|1 1|2
Notations and Recombinant Genotype Haplotype Configuration 0 recombinant Mother Father Child : recombinant recombinant MotherFather Child
Haplotype Configuration Reconstruction Haplotypes: useful, but expensive to obtain Genotypes: not so informative, but cheaper to obtain In biological application, genotypes instead of haplotypes are collected. How to reconstruct haplotype from genotype? recombination-free assumption (b)
The ZRHC problem Problem definition Given a pedigree and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.
Previous Work Li and Jiang introduced a system of linear equations over F[2] and presented an time algorithm for ZRHC [LJ03], where m is #loci and n is #members in pedigree. Several attempts have been made recently, but the authors failed to prove the correctness of their algorithms in all cases, especially when the input pedigree has mating loops [CZ04] [LCL06]. Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops.
Related work Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k ) on k equations with k unknowns The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86]
Our Result We present a much faster algorithm for ZRHC with running time. Ax=b transformation redundancy elimination O(n log 2 n log log n) O(n)
Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn 3 ) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion Ax=b
The New Linear System n, m m : #loci n : #members in pedigree Unknowns : the paternal haplotype vector of a member j. : the scalar demonstrating inheritance info between a parent j 1 and a child j.
The New Linear System j 2 j 1 j P j1,1 p j1,2 p j1,3 p j1,4 j 2 j j 1 P j2,1 p j2,2 p j2,3 p j2,4 P j2,1 +0 p j2,2 +1 p j2,3 +1 p j2,4 +1 P j,1 p j,2 p j,3 p j,4 P j,1 +1 p j,2 +1 p j,3 +0 p j,4 +0 h j1,j h j2,j P j1 +w j1 P j1 P j2 P j2 +w j2 P j1,1 +1 p j1,2 +0 p j1,3 +0 p j1,4 +1 PjPj P j +w j p j1,2 =1 p j1,3 =0
The Linear System O(mn) equations on O(mn) unknowns. Given a homozygous locus i on a member j (with a child j 1 ), p j [i] and p j1 [i] are pre-determined.
Pedigree Graph A pedigree with genotype Pedigree graph G #edges · 2n
Locus Graph Locus graph G i Example: Locus graph for the 3 rd locus G i = (V, E i ), where E i = {(k,j)| k is a parent of j, w k [i]=1} (a) Genotype info Zero-weight : ? h 1,4 h 4,9 h 8,9 h 6,8 (b) Locus graph
Introduction and problem definition A new system of linear equations for ZRHC An O(mn 3 ) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion Outline Ax=b transformation O(n) O(mn)
An Observation For any cycle or any path in a locus graph connecting two pre- determined vertices, the summation of h -variables along the path is a constant. We can use paths to denote constraints! a constant + d j 0, j 1 … P j 1 [i] h j 1, j 2 P j 2 [i]P j k-1 [i]P j k [i] h j k-1, j k d j 1, j 2 d j k-1, j k P j 1 [i]+ d j 1, j 2 + h j 1, j 2 = P j 2 [i] P j 2 [i]+ d j 2, j 3 + h j 2, j 2 = P j 3 [i] … P j k-1 [i]+ d j k-1, j k + h j k-1, j k = P j k [i] P j 0 [i] h j 0, j 1 d j 0, j 1 P j 0 [i]= P j 1 [i] + h j 0, j 1 (proof sketch) Assume the path in locus graph G i connecting two pre-determined vertices j 0 and j k.
Examples of Linear Constraints ? h 8,9 h 6,8 (a) 1 st locus graph h 6,8 + h 8,9 = ? ? 1 ? ? : (b) 2 nd locus graph h 3,5 + h 3,6 + h 2,5 + h 2,6 = 0 h 2,5 h 3,5 h 3,6 h 2, ?? ? ? ? ? ? 0 1 h 6,8 h 2,4 h 2,5 h 3,5 h 3,6 h 4,9 (c) 3 rd locus graph h 4,9 + h 2,4 + h 2,5 + h 3,5 + h 3,6 + h 6,8 = 0
Linear Constraints Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient. Moreover, we can upper bound #constraints in each locus graph as O( n ), while the trivial analysis gives an upper bound O( n 2 ). Total #constraints = O( mn ).
The ZRHC-PHASE algorithm Algorithm ZRHC_PHASE input: a pedigree G =( V, E ) and genotype {g j } output: a general solution of {p j } begin Step 1. Preprocessing Step 2. Linear constraint generation on h -variables Step 3. Solve h -variables by Gaussian Elimination Step 4. Solve the p -variables by propagation from pre-determined p -variables to others. end Our method Solve h -variables and p - variables separately O(mn) linear equations on O(n) h -variables. Traditional method Solve h -variables and p - variables together O(mn) equations on O(mn) unknowns: O(mn) p- variables and O(n) h- variable s.
Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn 3 ) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion Ax=b transformation redundancy elimination O(n log 2 n log log n) O(n) O(mn)
Redundant Equation Elimination j0j0 j1j1 j k-1 jkjk j k-2 j2j2 … An observation Given a cycle, assume that there are constraints among each pair of vertices. Originally, there are O ( k 2 ) constraints. Notice that they are not independent. However, we can replace the original constraints by an equivalent set of constraints with size O ( k ). j 2 ~ j k-1 j0 ~ j2j0 ~ j2 j 0 ~ j k-1 Remove the redundant equations without solving them! Key lemma
Given a spanning tree, the stretch of an edge ( k, j ) is defined as the length of the unique path between k and j on the tree. Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with average stretch O(log 2 n log log n). The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sum of stretches O(nlog 2 n log log n). Redundant Equation Elimination
Conclusion We present an efficient algorithm for ZRHC with running time O(mn 2 + n 3 log 2 n log log n). It remains interesting if the time complexity for ZRHC on general pedigrees can be improved to O( mn 2 + n 3 ) or lower. Another open question is how to use the algorithm to get haplotype configurations on pedigrees that require only a small (constant) number of recombinants
Thanks for your time and attention!