Huei-Hun E Tseng1 and Martin Tompa BMC Bioinformatics 2009 Presenter : Seyed Ali Rokni Algorithms for locating extremely conserved elements in multiple sequence alignments
Hundreds of long genomic sequences extraordinarily conserved across human, mouse, and rat Ultraconserved Element At least 200 consecutive alignment columns 100% perfectly conserved in human, mouse, and rat 481 such elements across the human genome some fractions are also well conserved in dog, in chicken Human-mouse-dog, Human-chicken percent of perfectly conserved columns phylogeny phylogenetic hidden Markov model Introduction
Phylogenetic tree
Multiple Sequence Alignment
Dynamic Programming (i-1,j-1,k-1) (i,j-1,k-1) (i,j-1,k) (i-1,j-1,k) (i-1,j,k) (i,j,k) (i-1,j,k-1) (i,j,k-1)
Limited to 2 or 3 species current 44-vertebrate whole-genome alignment UCSC Genome Browser Goal: Finding long regions of this alignment that are extraordinarily well conserved across all or most of the 44 species Example: Min Length of Col: at least 100 consecutive alignment columns Min Size of Subset: for some subset S of at least 40 of the 44 species Min Percentage: at least 80% of the columns are perfectly conserved approximately 250 GB Generalization of Ultraconserved Elements
Inputs: m × n alignment matrix M with entries from {A, C, G, T, -}, integer s ≤ m, integer t ≤ n, and real number 0 < c ≤ 1. Problem: Determine if M has a subset S of rows, |S| ≥ s a subset T of consecutive columns (ignoring gap character “-” in every row of S), |T| ≥ t a subset U of T, |U| ≥ c|T| s.t in M restricted to S × U, every column is perfectly conserved Example: m = 44, n ≈ 3.8 × 109, s = 40, t = 100, and c = 0.8 Problem Formal Definition
If s = m, the Extremely Conserved Element problem can be solved in time O(mn). Proof: Assume: no column contains the gap character “-” in every row For 1 ≤ i ≤ n, let qi = 1 if column i is perfectly conserved, and qi = 0 otherwise. The results then follows Theorem 2. Theorem 1
Proof
Proof (Cont.)
Y X Non-increasing Merge Yj and Xi are adjacent maximal interval qi+1... qj During merging maximum interval can be found
The dual of Theorem 2, maximizing c subject to a lower bound on j - i, also O(n) for s = m, the maximum value of c can be determined in time O(mn) the maximum value of t can be determined in time O(mn) Dual of Theorem 2
If c = 1, the Extremely Conserved Element problem can be solved in polynomial time. In fact, the maximum value of s can be determined in this time Proof: For every choice T of at least t consecutive columns, sort the rows of T lexicographically Find s identical rows, with at least t nongap characters Theorem 3
The general Extremely Conserved Element problem is NP-hard Idea: Want a solution of A Knowing a solution of B A B Solve B Knowing A is NP-Hard B is NP-hard Theorem 4
177 EC(40, 100, 0.8) elements Partially coding: overlaps a human coding exon The longest element is 355 columns long and is perfectly conserved in 80% across 41 of the species missing only gorilla, shrew, and lamprey Results
Lamprey, missing from 170 Gorilla missing from 41 Cat missing from 35 Zebrafish missing from 30 Fugu missing from only 4 Zebra finch missing from only 3 Chicken is missing from only 2 Lizard is missing from only 1
Questions