1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015
2 Goal The goal is the use of pattern graph for discovering conserved patterns in a set of related protein sequences
3 Pratt *Is a tool that allows the user to search for patterns conserved in a set of protein sequences. * It must be specify what kind of patterns should be searched for and how many sequences should match a pattern to be reported *This tool is the implementation of an algorithm proposed by Jonassen in 1995 for the discovery of patterns of the PROSIT types allowing for both ambiguous pattern position and variable length gaps. *Pratt searches for patterns matching at least a specified number of a given sequence and then ranked the patterns discovered according to the highest scoring function
4 PROSITE Is a database of protein families containing more than 1,100 entries. For each family it gives a pattern or a profile which can be used to identify new members of the family. The results indicate that Pratt able to discover useful patterns for some protein families
5 Efficient Discovery of Conserved Pattern Using a Pattern Graph In 1996 Jonassen has proposed an alternative approach for finding patterns common to at least k out of n given sequences. The pattern graph concept is introduced. It assumes that the pattern has a determined form defines the transformation operations that allow the generation of a pattern from another given a sequence S = s 1,s 2, ……s m,with length l, the pattern is represented as graph it constructs the graph it uses DFS to find all the possible patterns derived from a given path in the graph. For all these patterns, selects the most significance one based on the highest score function
6 Terminology and definitions (1) The algorithm finds the most interesting patterns matching some minimum number of a given set of sequences. These sequences are string of alphabet which represent the alphabet of a sequences of nucleotide Definition a class of patterns : A pattern P in the class C is considered in the form: P = A 1 ---x (i 1, j 1 )----A 2 ---x ( i 2, j 2 )………x ( i P-1, j P-1 )---- A P (1) where: A 1, A 2,……..A P are called pattern component of P A pattern component can be identity or ambiguous x ( i 1, j 1 )….. are called the wildcard region where i 1 j 1 …. are integers number non negative. Wildcard regions can be fixed and flexible the flexibility is defined as j-i
7 Exemple: P = A---[ DE]---x (3 )----G----x ( 3,4 )---L A 1 = A, i 1 = j 1 = 0, A 2 = {D,E }, i 2 = j 2 = i 3 = 3, A 3 = G, j 3 =4, A 4 = L the length of pattern is 6, p = 4 A pattern P 1 to match P, each patterns component of P 1 must match each pattern component of P P = 4, L = 6, W = 4, F = 1, N = 1, FP = 2
8 Definitions (2) We define a class of patterns C that will be discovered during the work of the algorithm. We define a set of bounds = ( A, P, L, W, F, N, FP) where A 2 is the set of pattern components P the maximum number of component L the maximum length of patterns W the maximum length of wildcard region F the maximum flexibility of the wildcard region N the maximum number of flexible wildcard regions FP the maximum product of flexibility which is : p k =1 ( j k - i k +1 )
9 Generalisation of Patterns Definition : a Pattern P 1 is said to be generalisation of another pattern P 2 if for any sequence matching P 2 will matches also P 1. The concept of generalisation : Given a class of patterns C,we define a family of transformation operators i c i = { 1,2,3 } that can applied on a pattern P in c and produces another pattern P 1 in C. These operators are defined as follows: 1- P 1 c P 1 : P 1 is generated from P by the first transformation operator, if P 1 can be obtained by deleting or adding a pattern component c from P, formally : - P = c----x ( i, j ) ---- P 1 - P = P x ( i, j ) -----c - P = P x ( i 1, j 1 ) ---- c ---- x ( i 2, j 2 ) P 2 and P 1 = P x ( i 1 + i 2 +1, j 1 + j 2 +1 ) P 2 for P, P 1, P 1, P 2 C
10 Generalisation of pattern (2 ) 2- P 2 c P 1 by substitution a component c in P with less restrictive one c 1 : P = P x ( i 1, j 1 ) ---- c ---- x ( i 2, j 2 ) P 2 P 1 = P x ( i 1, j 1 ) ---- c x ( i 2, j 2 ) P 2 P 1, P 2 C c c 1 A 3- P 3 c P 1 by allowing more flexibility in the wildcard regions of P P = P x ( i, j ) ---- P 2 P 1 = P x ( i 1, j 1 ) --- P 2 where i 1 i, j 1 j and ( i 1, j 1 ) ( i, j ) for some P 1, P 2, P 1 C more generally P c P 1 if and only if P i c P 1 i { 1, 2, 3 }
11 Exemple: Given the pattern A-----B -----C-----D can be generalised to [AB]----B----- x ( 1,3)----- D. A----B----C-----D 2 c [AB]---B-----C----D 1 c [ AB]----B-----x---D 3 c [AB]----B---- x (1,3)------D Patterrn Scoring function : The score of pattern P is given in the form (1) is : I ( P ) = p i-1 I 1 ( A i ) - c. p-1 k= 1 ( j k - i k ) where c is a constant and I 1 ( A i ) is the information contents of the pattern component. The pattern that contains more information has more highest scoring patterns and that is ranked in the top of the patterns. This function is used in Pratt to rank all the patterns discovered.
12 Pattern Graph Pattern graph is a directed graph G = ( V, E ) where the nodes V represent the patterns component, and the edges E represent the wildcard regions. ( u) is the label of a node v V, the edge e E is labelled with the minimum and the maximum 1 number of residues to match the wildcard region.
13 Exemple for a pattern P = A----B----x ( 0,2)---C----x (3)----D we can construct he following graph: (u) = A, (v) = B, (w) = C, (x) = D a path = u 1, u 2,…u n in G defines the pattern : that means a path u, w, x defines the pattern : ( P) = A----x ( 0,1)----- C----x (3)----D
14 Definition (3) we define ( G,C) to be the set of all the patterns that can be C-generalisation from the set of patterns in C defined by the paths in G. ( G,C) = P Path U in G (P) C { P / (P) * c P } we define 1 ( G,C) is a set of patterns in C that can be derived from a patterns defined by path in G using restrictive transformation operations: P c P 1 if and only if P i c P 1 i { 2, 3 } 1 ( G,C) = P Path U in G (P) C { P / (P) * c P } the goal is to find 1 ( G,C) and to prove that 1 ( G,C) = ( G,C)
15 Constructing a pattern graph Input : set of sequences S= s 1, s 2,…….s n where S i = s 1 i s i 2 …..s i j bounds specifying a class C minimum number of sequences k < n that a pattern should match. Output : 1- a pattern graph G 2- 1 ( G,C) 3- Finding the highest scoring patterns matching at least k sequences. 4- pruning the highest scoring patterns
16 Constructing pattern graph from a sequence Given a set of bounds defining a class of patterns C and a sequence s = s 1,s 2,…..s l. The algorithm works in phases. In the first phase, it defines the nodes starting by the root and in the second phase it defines the edges. Phase 1 - if G contains one node u i - for each character s i in s, that is a pattern component in A label u i with s i A, (u i ) = s i phase 2 - for each node u i make an edge to all node u j which i < j min ( i+ w + 1, l) - label this edges (u i, u j ) as ( j-i-1, j-i -1)
17 Exemple S = ABCDEFG Algorithm properties : - 1 ( G,C) contains all patterns in C matching S - each pattern in 1 ( G,C) matches S - 1 ( G,C) = ( G,C)
18 Constructing a pattern graph from a multiple alignment The goal is to construct a pattern graph G with 1 ( G,C) = ( G,C). Input : let be an alignment of the sequences S = M 1 ….M m l is the length of alignment. A sequence M i = M i 1,……..M i li where M i j is the j th character in the sequence M i we number the column alignment from left to right the column i represent a vector c i 1 ……c i m where c i j = k if the i th column in contains the k th character from sequence M i or 0 if the i th column contains a gap the graph is constructed by all the ungapped column
19 Constructing a pattern graph from a multiple alignment (2) the algorithm works in steps, in the first one define the nodes of the graph and in the second step defines the edges. Step 1 : - for each ungapped column in make a node u i for column c i - the set of symbols present in that column represent the allowable pattern components - label u i with the smallest set a Step2 : - each pair of nodes u i, u j, i < j correspond to a column i, j in are the minimum and the maximum number of sequence symbols in each sequence between column i, j - for each edge u i u j label it with (
20 Exemple :
21 Simple depth -first search using the graph Until now we are constructing the graph. The next step is to find the set of conserved patterns 1 ( G,C) in the graph using DFS. That means constructing a search tree rooted in an empty pattern and contains all the k- pattern in 1 ( G,C) at depth k Definition : k- pattern is defined by k-path in G and the C- generalisation operation ( of type 2,3 ) applied on it. Input conserved k-pattern P k-path P in G from which P has been derived output : generating all the simple possible extension of P that are in C and can be derived from an extension of the path P. checked if the patterns generated are conserved or not.
22 how are generated an extension of P Let P = v 1, v 2,….v k and there are edges from v k to w 1,……w l each path p : v k,…..w l define a pattern P ( p l ) is a simple extension of P where: P ( p l ) = P x ( v k w l ), v k w l ) ---- ( w l ) or = P x ( i l, j l ) A l for each pattern P ( p l ) we can generate a simple extension by applying the operator type 2 on A l and the operator type 3 on x ( i l, j l ) Example : let G be a graph and assume F = 1 and A = { { A }, { B }, {C }, { D}, { E }, {F}, {G}} assume that the pattern P = A x------C------D was derived from the path p = A,C,D
23 Example ( continue ) the path p can be extended along any of the edges
24 Simple depth -first search using the graph (2) running the search procedure recursively can generate all the patterns in 1 ( G,C) and then we check if they are conserved or not. Pruning the search : - find the highest scoring patterns means for all node u in G we need to find the most expressive conserved pattern from a path started in u. - The search can be done in different cases: 1- no flexibility no ambiguity is allowable 2- no flexibility but allowing for ambiguity 3- general case
25 Pruning the search (1) no flexibility no ambiguity is allowed: A = { {a } / a } F = 0 in this case pattern is directly defined by the path and the longest path will give the highest scoring pattern. Property : for a given graph G = ( V, E ) if a node u has edges to v and w where u < v <w then there will be an edge from v to w. Defining an ordering relation < 1 we can ordered the child nodes of a given node x in a manner such if : x i < 1 x j then in the patterns P x i, P x j : w i < w j result : there are no need to explore all the subtree x i+1 ….x l to find the highest scoring pattern.
26 Pruning the search (2) no flexibility but ambiguity is allowed: if x 1 is a child of node x in the search tree that correspond to path p i = v 1 ……v k, w i, the pattern derived from such path is P x ---- x ( v k w i ))-----A, let Ind ( x 1 ) = index ( w i ) and Amb ( x 1 ) = |A| we define a partial order < 2 ordering of the children of x so : x 1 < 2 x 11 if Ind ( x 1 ) < Ind ( x 11 ) or if ind ( x 1 ) = ind ( x 11 ) and Amb ( x 1 ) < Amb ( x 11 ) two nodes x 1, x 11 which : ( Ind (x 1 ), Amb (x 1 ) ) = (Ind ( x 11 ), Amb (x 11 ) ) are ordered arbitrarily. If a pattern of child x 1 matches the same number of segments as P x then all the child of x after x 1 will not be analysed because they cannot give a higher scoring pattern
27 Pruning the search (3) the general case: Each child x 1 of x defines a pattern P 1 = P----x ( i, j)-----a each node w i is appended to the path is defined by : a A such w ) a the flexibilty of the wildcard region defined by the edge v k w i given Inde(x 1 ),Amb(x 1 ), F ( x 1 ) = j-i we define a partial order < 3 of the children of s that : x 1 < x 11 if : Ind (x 1 ) < Ind (x 11 ) or Ind ( x 1 ) = Ind ( x 11 ) and Amb ( x 1 ) < Amb ( x 11 ) or (Ind ( x 1 ), Amb ( x 1 )) = ( Ind ( x 11 ), Amb ( x 11 )) and F (x 1 ) < F (x 11 ) two nodes x 1, x 11 for which (Ind ( x 1 ), Amb ( x 1 ), F (x 1 ) ) = ( Ind ( x 11 ), Amb ( x 11 ), F (x 11 )) are ordered arbitrarily
28 Pruning the search (3) the general case (continue): if P 1 a pattern corresponding to a child x 1 of x, if the extend of P 1 matches at least a certain proportion of the segments matched by P we do not analysis other children of x because if a P is a real conserved pattern and the extension P 1 matches at least k segments, then we would expect only a small proportion of segments in the set of segments that matches the pattern P to extend to segment matching P 1. P 1 is conserved pattern and no additional expansion of P need to be explored
29 Complexity analysis: Time complexity : the algorithm search for all patterns conserved in at least k sequences of n sequences with average length l, the class of patterns C is given by a set of bounds ( A,P,L, W, F, N,FP) then the time complexity to analysis a pattern graph G ( V, E ) constructed from the n-k+t shortest sequences is O ( |E|.P.N) where P = O (L ) and L = O ( n.l ) is the total length of all sequences. The worst case time complexity is exponential in the maximum pattern length P which is the maximum depth of the search tree. Space complexity : the space needed to store the graph is : O ( |E|. g 2 / 8 + |V| ).(W+1+N:P) bytes where g is the maximum number of generalisations of a patterns component.
30 References
31 References (2)