Using PQ Trees For Comparative Genomics - CPM Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson Oren Weimann – Univ. of Haifa
Using PQ Trees For Comparative Genomics - CPM Gene Clusters Genes that appear together consistently across genomes are believed to be functionally related, however the ordering doesn’t have to be the same. Genome 1 Genome 2 Genome 3 Genome 4 Genome 5
Using PQ Trees For Comparative Genomics - CPM What is a Pattern? [WABI04] Given a string S=“s 1 s 2 s 3 ….s n ” and an integer K, a pattern P={p 1,p 2,p 3,…,p m } is a pattern if P occurs (possibly permuted) in at least K places in S. Example: S =a b c d b a c d a b a c b P = {a,b,c} K=4 P is a 4- Pattern with location-list = {1,5,10,11} For the moment we will assume that every character appears once in the pattern.
Using PQ Trees For Comparative Genomics - CPM S = a b c d e b a d c e Maximal Patterns Maximal notation - a representation of a maximal pattern p that illustrates all the non-maximal patterns with respect to p. Our goal: Find all patterns p and their maximal notation. Our solution – a linear time algorithm based on PQ trees. S = a b c d e b a d c e {a,b} is non-maximal with respect to {a,b,c,d,e} The maximal notation of {a,b,c,d,e} is ((a,b)-(c,d)-e)
Using PQ Trees For Comparative Genomics - CPM PQ trees: Booth, Lueker Definitions PQ trees [Booth, Lueker, 1976] Character labeled leaves. P-nodes: Represent “truly permuted” components Arbitrary permutations of children Q-nodes: Represent bi-connected components Only “reversion” B EF GH I JK B D AC D
Using PQ Trees For Comparative Genomics - CPM PQ trees: Definitions Equivalent PQ trees ( ). EF GH I JK B D AC EF GH I JK B D AC
Using PQ Trees For Comparative Genomics - CPM PQ trees: Definitions FRONTIER: C(T)= the set of frontiers of all trees equivalent to T: EF GH I JK B D AC FRONTIER(T)=“A B C D E F G H I J K” FRONTIER(T)=“A B C G H I J K E F D" Theorem: If C(T 1 )=C(T 2 ) then T 1 T 2.
Using PQ Trees For Comparative Genomics - CPM Our Use of the PQ tree Suppose the Pattern {a,b,c,d} appears in 4 locations as: = { abcd, acbd, dbca, dcba }. Our goal: C(T) = { abcd, acbd, dbca, dcba }. Write the P-nodes as ‘,’ and the Q-nodes as ‘-’ and get: (a-(b,c)-d) which is exactly the maximal notation of the Pattern {a,b,c,d} bc a d
Using PQ Trees For Comparative Genomics - CPM The minimal Consensus PQ tree It is not always possible to find a tree T where =C(T): Consider a Pattern {a,b,c,d} that appears as: = { abcd, bdac }. { abcd, bdac } C(T) Given permutations ={ 1, 2,…, k }, the consensus PQ tree T of is such that C(T), and the consensus is minimal when there exists no other T’ such that C(T’) and |C(T’)| |C(T)|. The problem of obtaining a maximal notation for a Pattern is the same as obtaining a minimal consensus PQ tree of all the k occurrences. Theorem: The minimal consensus PQ tree T is unique. bcad
Using PQ Trees For Comparative Genomics - CPM The original use of the PQ Tree The consecutive 1’s problem: The restriction sets: F = { {a,b,c}, {b,c}, {b,c,d}, {b} } The solution [Booth, Lueker, 1976]: Reduce(F )= The result will be C(T), in our case C(T)={abcd, acbd, dbca, dcba} and the tree was constructed in O( ) time (for an n x n matrix) (Reduce(F) by [Booth, Lueker, 1976]) a b c d bc a d
Using PQ Trees For Comparative Genomics - CPM Obtaining the Minimal Consensus PQ tree Some definitions [Heber, Stoye, 2001]: Common interval – an interval that appears as a consecutive sequence in all the appearances. [4-8] in the example. We denote = all Common intervals = { [1-2],[2-3],[1-3],[1-9],[1-8],[4-5],[4-6],[4-7],[4-8],[5-6] } A list p of common intervals is a chain if every two successive intervals in p have a non-trivial overlap. For example P=([1-2],[2-3]) A common interval is called reducible if there is a chain that generates it, otherwise it is called irreducible. [1-3] is a reducible interval since it can be generated by the irreducible intervals [1-2],[2-3] We denote = all irreducible intervals of = { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } 11 22 3 11 22 3 11 22 33
Using PQ Trees For Comparative Genomics - CPM Theorem: Reduce( ) = Reduce( ) = minimal consensus tree. The Algorithm: Compute. { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } Compute Reduce( ) to get the minimal consensus tree of . The Pattern notation is: ((1-2-3)-(((4-5-6),7),8)-9) Time Complexity: For a a pattern of size n that appears in k places it takes a total of O(kn+ ) to compute the maximal notation. Obtaining the Minimal Consensus PQ tree 11 22 3
Using PQ Trees For Comparative Genomics - CPM Improving the Time Complexity to O(kn) In Heber & Stoye’s algorithm for obtaining, a data structure S was maintained to hold the chains of the irreducible intervals: = { [1-2],[1-8],[2,3], [4,5],[4,8],[4,8],[5,6] } REPLACE(S): Replace every chain by a Q node. Replace every element that is not a leaf or a Q node and is pointed by a vertical link with a P node 11 22 3
Using PQ Trees For Comparative Genomics - CPM Maximal Patterns and Sub-Trees A sub-tree of the PQ tree T is obtained by picking a P-node in T with all it’s descendants, or by picking a Q-node in T with any number of consecutive descendants. Suppose the Pattern {a,b,c,d} appears in 4 locations as: = { abcd, acbd, dbca, dcba }. Theorem: If p1 and p2 are patterns, and p1 is non-maximal with respect to p2, then the PQ Tree T1 that represents p1 is a sub-tree of the PQ tree T2 that represents p2. bc a d
Using PQ Trees For Comparative Genomics - CPM So what did we achieve? A first algorithm (and optimal in time) that generates the maximal notation of a pattern. A “bottom-up” construction of a PQ tree. A visualization of the inner structure of a pattern. Filtering of meaningful from apparently meaningless (non- maximal) clusters. Experimental results that prove this tool can aid in predicting gene functions. Clustering for the various genome models.
Using PQ Trees For Comparative Genomics - CPM Using Our Tool for Various Genome Models Genome model I (orthologs only): A sequence is a permutation of the set {1,2…,n}. Only one maximal pattern {1,2….,n}. In O(kn) time we get a PQ tree that describes all patterns of all sizes and their non-maximal relations.
Using PQ Trees For Comparative Genomics - CPM Using Our Tool for Various Genome Models Genome model II : A gene may appear once in a sequence or not appear at all in that sequence. We can extend the algorithm to work on sequences that are not permutations of the same set in : Example: consider the 2 sequences and ’ 6 7 8’ and ’ ’ add characters as needed: Build PQ Tree on the new sequences: ‘5 The sub-trees that have no red leaves Are all the maximal patterns 5’ 6 7
Using PQ Trees For Comparative Genomics - CPM Using Our Tool for Various Genome Models Genome model III (paralogs and orthologs): A gene may appear any number of times in a sequence (including zero). The minimal consensus PQ tree is not necessarily unique. Solution: Example: consider 2 appearances of the pattern {a,a,b} as = { aab, baa }: 1. = { a 1 a 2 b, ba 2 a 1 } C(T)= { a 1 a 2 b, ba 2 a 1 } 2. = { a 1 a 2 b, ba 1 a 2 } C(T)= { a 1 a 2 b, ba 2 a 1, a 2 a 1 b, ba 1 a 2 } a1a1 a2a2 b b a1a1 a2a2
Using PQ Trees For Comparative Genomics - CPM It