Download presentation
Presentation is loading. Please wait.
1
Using PQ Trees For Comparative Genomics - CPM 20051 Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson Oren Weimann – Univ. of Haifa
2
Using PQ Trees For Comparative Genomics - CPM 20052 Gene Clusters Genes that appear together consistently across genomes are believed to be functionally related, however the ordering doesn’t have to be the same. Genome 1 Genome 2 Genome 3 Genome 4 Genome 5
3
Using PQ Trees For Comparative Genomics - CPM 20053 What is a Pattern? [WABI04] Given a string S=“s 1 s 2 s 3 ….s n ” and an integer K, a pattern P={p 1,p 2,p 3,…,p m } is a pattern if P occurs (possibly permuted) in at least K places in S. Example: S =a b c d b a c d a b a c b P = {a,b,c} K=4 P is a 4- Pattern with location-list = {1,5,10,11} For the moment we will assume that every character appears once in the pattern.
4
Using PQ Trees For Comparative Genomics - CPM 20054 S = a b c d e b a d c e Maximal Patterns Maximal notation - a representation of a maximal pattern p that illustrates all the non-maximal patterns with respect to p. Our goal: Find all patterns p and their maximal notation. Our solution – a linear time algorithm based on PQ trees. S = a b c d e b a d c e {a,b} is non-maximal with respect to {a,b,c,d,e} The maximal notation of {a,b,c,d,e} is ((a,b)-(c,d)-e)
5
Using PQ Trees For Comparative Genomics - CPM 20055 PQ trees: Booth, Lueker Definitions PQ trees [Booth, Lueker, 1976] Character labeled leaves. P-nodes: Represent “truly permuted” components Arbitrary permutations of children Q-nodes: Represent bi-connected components Only “reversion” B EF GH I JK B D AC D
6
Using PQ Trees For Comparative Genomics - CPM 20056 PQ trees: Definitions Equivalent PQ trees ( ). EF GH I JK B D AC EF GH I JK B D AC
7
Using PQ Trees For Comparative Genomics - CPM 20057 PQ trees: Definitions FRONTIER: C(T)= the set of frontiers of all trees equivalent to T: EF GH I JK B D AC FRONTIER(T)=“A B C D E F G H I J K” FRONTIER(T)=“A B C G H I J K E F D" Theorem: If C(T 1 )=C(T 2 ) then T 1 T 2.
8
Using PQ Trees For Comparative Genomics - CPM 20058 Our Use of the PQ tree Suppose the Pattern {a,b,c,d} appears in 4 locations as: = { abcd, acbd, dbca, dcba }. Our goal: C(T) = { abcd, acbd, dbca, dcba }. Write the P-nodes as ‘,’ and the Q-nodes as ‘-’ and get: (a-(b,c)-d) which is exactly the maximal notation of the Pattern {a,b,c,d} bc a d
9
Using PQ Trees For Comparative Genomics - CPM 20059 The minimal Consensus PQ tree It is not always possible to find a tree T where =C(T): Consider a Pattern {a,b,c,d} that appears as: = { abcd, bdac }. { abcd, bdac } C(T) Given permutations ={ 1, 2,…, k }, the consensus PQ tree T of is such that C(T), and the consensus is minimal when there exists no other T’ such that C(T’) and |C(T’)| |C(T)|. The problem of obtaining a maximal notation for a Pattern is the same as obtaining a minimal consensus PQ tree of all the k occurrences. Theorem: The minimal consensus PQ tree T is unique. bcad
10
Using PQ Trees For Comparative Genomics - CPM 200510 The original use of the PQ Tree The consecutive 1’s problem: The restriction sets: F = { {a,b,c}, {b,c}, {b,c,d}, {b} } The solution [Booth, Lueker, 1976]: Reduce(F )= The result will be C(T), in our case C(T)={abcd, acbd, dbca, dcba} and the tree was constructed in O( ) time (for an n x n matrix) (Reduce(F) by [Booth, Lueker, 1976]) a 1 0 0 0 b 1 1 1 1 c 1 1 1 0 d 0 0 1 0 bc a d
11
Using PQ Trees For Comparative Genomics - CPM 200511 Obtaining the Minimal Consensus PQ tree Some definitions [Heber, Stoye, 2001]: Common interval – an interval that appears as a consecutive sequence in all the appearances. [4-8] in the example. We denote = all Common intervals = { [1-2],[2-3],[1-3],[1-9],[1-8],[4-5],[4-6],[4-7],[4-8],[5-6] } A list p of common intervals is a chain if every two successive intervals in p have a non-trivial overlap. For example P=([1-2],[2-3]) A common interval is called reducible if there is a chain that generates it, otherwise it is called irreducible. [1-3] is a reducible interval since it can be generated by the irreducible intervals [1-2],[2-3] We denote = all irreducible intervals of = { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33
12
Using PQ Trees For Comparative Genomics - CPM 200512 Theorem: Reduce( ) = Reduce( ) = minimal consensus tree. The Algorithm: Compute. { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } Compute Reduce( ) to get the minimal consensus tree of . The Pattern notation is: ((1-2-3)-(((4-5-6),7),8)-9) Time Complexity: For a a pattern of size n that appears in k places it takes a total of O(kn+ ) to compute the maximal notation. Obtaining the Minimal Consensus PQ tree 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 123 9 8 7 456
13
Using PQ Trees For Comparative Genomics - CPM 200513 Improving the Time Complexity to O(kn) In Heber & Stoye’s algorithm for obtaining, a data structure S was maintained to hold the chains of the irreducible intervals: = { [1-2],[1-8],[2,3], [4,5],[4,8],[4,8],[5,6] } REPLACE(S): Replace every chain by a Q node. Replace every element that is not a leaf or a Q node and is pointed by a vertical link with a P node. 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 123 9 8 7 456
14
Using PQ Trees For Comparative Genomics - CPM 200514 Maximal Patterns and Sub-Trees A sub-tree of the PQ tree T is obtained by picking a P-node in T with all it’s descendants, or by picking a Q-node in T with any number of consecutive descendants. Suppose the Pattern {a,b,c,d} appears in 4 locations as: = { abcd, acbd, dbca, dcba }. Theorem: If p1 and p2 are patterns, and p1 is non-maximal with respect to p2, then the PQ Tree T1 that represents p1 is a sub-tree of the PQ tree T2 that represents p2. bc a d
15
Using PQ Trees For Comparative Genomics - CPM 200515 So what did we achieve? A first algorithm (and optimal in time) that generates the maximal notation of a pattern. A “bottom-up” construction of a PQ tree. A visualization of the inner structure of a pattern. Filtering of meaningful from apparently meaningless (non- maximal) clusters. Experimental results that prove this tool can aid in predicting gene functions. Clustering for the various genome models.
16
Using PQ Trees For Comparative Genomics - CPM 200516 Using Our Tool for Various Genome Models Genome model I (orthologs only): A sequence is a permutation of the set {1,2…,n}. Only one maximal pattern {1,2….,n}. In O(kn) time we get a PQ tree that describes all patterns of all sizes and their non-maximal relations.
17
Using PQ Trees For Comparative Genomics - CPM 200517 Using Our Tool for Various Genome Models Genome model II : A gene may appear once in a sequence or not appear at all in that sequence. We can extend the algorithm to work on sequences that are not permutations of the same set in : Example: consider the 2 sequences 1 2 3 4 5 6 7 and 1 8 2 4 3 7 6 8 1 2 3 4 5 5’ 6 7 8’ and 5 1 8 8’ 2 4 3 7 6 5’ add characters as needed: Build PQ Tree on the new sequences: 8 34 12 8‘5 The sub-trees that have no red leaves Are all the maximal patterns 5’ 6 7
18
Using PQ Trees For Comparative Genomics - CPM 200518 Using Our Tool for Various Genome Models Genome model III (paralogs and orthologs): A gene may appear any number of times in a sequence (including zero). The minimal consensus PQ tree is not necessarily unique. Solution: Example: consider 2 appearances of the pattern {a,a,b} as = { aab, baa }: 1. = { a 1 a 2 b, ba 2 a 1 } C(T)= { a 1 a 2 b, ba 2 a 1 } 2. = { a 1 a 2 b, ba 1 a 2 } C(T)= { a 1 a 2 b, ba 2 a 1, a 2 a 1 b, ba 1 a 2 } a1a1 a2a2 b b a1a1 a2a2
19
Using PQ Trees For Comparative Genomics - CPM 200519 It
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.