Presentation is loading. Please wait.

Presentation is loading. Please wait.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Similar presentations


Presentation on theme: "Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06."— Presentation transcript:

1 Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06 04/14/06

2 Outline of the Presentation Motivation Motivation Introduction Introduction Description (Remote Homology Detection) Description (Remote Homology Detection) Methods Methods Results & Discussion Results & Discussion Q and A Q and A

3 Motivation Remote homology detection is the problem of detecting homology in case of low sequence similarity. Remote homology detection is the problem of detecting homology in case of low sequence similarity. A method based on presence of discrete sequence motifs for detecting remote homology. A method based on presence of discrete sequence motifs for detecting remote homology. The motif content of a pair of sequences is used to define a similarity that is used as a kernel for support vector machine (SVM) classifier. The motif content of a pair of sequences is used to define a similarity that is used as a kernel for support vector machine (SVM) classifier.

4 Testing of method is done upon two remote homology detection tasks Testing of method is done upon two remote homology detection tasks 1) Prediction of previously unseen SCOP family (Structural classification of Proteins). 1) Prediction of previously unseen SCOP family (Structural classification of Proteins). 2) Prediction of an Enzyme class given other enzymes that have a similar function on other substrates. 2) Prediction of an Enzyme class given other enzymes that have a similar function on other substrates.

5 Introduction Protein Homology detection is one of the most important problems in computational biology. Protein Homology detection is one of the most important problems in computational biology. Homology is generally established by sequence similarity. Homology is generally established by sequence similarity. Two established methods Two established methods 1) Smith Waterman algorithm 1) Smith Waterman algorithm 2) Blast 2) Blast Protein sequence motifs are an alternative method of detecting sequence similarity Protein sequence motifs are an alternative method of detecting sequence similarity

6 Intro(continued) By focussing on limited highly conserved regions of proteins, motifs can often reveal important clues to a proteins role. By focussing on limited highly conserved regions of proteins, motifs can often reveal important clues to a proteins role. Motifs often represent functionally important regions such as catalytic sites, binding sites and structural motifs. Motifs often represent functionally important regions such as catalytic sites, binding sites and structural motifs. The Blocks+ database combines various databases such as pFAM, PRINTs, ProDom, DOMO and InterPro. eMotif database contains discrete sequence motifs constructed from blocks of BLOCKS+. The Blocks+ database combines various databases such as pFAM, PRINTs, ProDom, DOMO and InterPro. eMotif database contains discrete sequence motifs constructed from blocks of BLOCKS+. This paper uses discrete sequence motifs extracted from the eBLOCKS database using the eMOTIF method. This paper uses discrete sequence motifs extracted from the eBLOCKS database using the eMOTIF method.

7 Intro(Continued) Based upon the motif content of a pair of sequence we introduce sequence similarity measure. Based upon the motif content of a pair of sequence we introduce sequence similarity measure. This paper uses an SVM method. This paper uses an SVM method. SVM method is shown to perform better than methods for Fisher-Kernel method, SAM T-98 and PSI-BLAST. SVM method is shown to perform better than methods for Fisher-Kernel method, SAM T-98 and PSI-BLAST. When a sequence similarity is shown to be a dot product in some space it is called the kernel. When a sequence similarity is shown to be a dot product in some space it is called the kernel. In this paper we use protein motifs to construct a kernel that can be computed efficiently which performs better than a kernel based on BLAST or Smith-Waterman scores. In this paper we use protein motifs to construct a kernel that can be computed efficiently which performs better than a kernel based on BLAST or Smith-Waterman scores.

8 Remote Homology Detection This method was tested on the following two tasks:- 1) Prediction of a SCOP family when trained on other families in that family’s fold. 2) Prediction of the function of an enzyme when the training set contains enzyme that have same general functions but different substrates.

9 BackGround of the first dataset The first dataset is composed of sequences of domains from the SCOP database. The first dataset is composed of sequences of domains from the SCOP database. Objective:- To detect homology at the SCOP superfamily level. Recognizing a SCOP family when the training set contains other families in the family’s superfamily.

10

11 …contd This specifies the +ve examples in the test set and training set. This specifies the +ve examples in the test set and training set. The –ve examples are taken from outside of the family’s fold. The –ve examples are taken from outside of the family’s fold. A random family is chosen to belong to A random family is chosen to belong to -ve test set & rest of the families in it’s superfamily are added to negative training set. -ve test set & rest of the families in it’s superfamily are added to negative training set.

12 The second dataset We use the classification of Enzymes to simulate remote homology. We use the classification of Enzymes to simulate remote homology. The function of an enzyme is given by EC number given it to by Enzyme Commision. The function of an enzyme is given by EC number given it to by Enzyme Commision. EC number is like n1.n2.n3.n4 EC number is like n1.n2.n3.n4 For eg 1.1.3.13 for alcohol oxidase. n1 – 1-6 :- indicates the type of chemical reaction catalyzed by the enzyme. n1 – 1-6 :- indicates the type of chemical reaction catalyzed by the enzyme. n2 – specifies donor molecule. n2 – specifies donor molecule. n3 – specifies the acceptor. n3 – specifies the acceptor. n4 – specifies the substrate. n4 – specifies the substrate.

13 …contd In this paper author concentrates on oxidoreductase (n1 = 1). In this paper author concentrates on oxidoreductase (n1 = 1). A classifier is trained to predict oxidoreductases with a certain function A classifier is trained to predict oxidoreductases with a certain function (n2 & n3). (n2 & n3). The classifier will be tested on oxidoreductases with adifferent substrate (n4) than those it was trained on. The classifier will be tested on oxidoreductases with adifferent substrate (n4) than those it was trained on.

14 For eg. EC class 1.14.13.8  Positive examples of training set. EC class 1.14.13.8  Positive examples of training set. EC class 1.14.13.39  Positive examples of test set. EC class 1.14.13.39  Positive examples of test set. So the similarity between the +ve training & test may not be very high. So the similarity between the +ve training & test may not be very high. Negative test & training set are defined analogusly. Negative test & training set are defined analogusly.

15 Methods  The Motif kernel When the similarity is a dot product it is called a kernel. When the similarity is a dot product it is called a kernel. The method is as follows:- Each position in the motif represents the variability in the column in a block from multiple sequence alignment. For eg the motif [as].dkf[filmv]..[filmv]…l[ast]. [as].dkf[filmv]..[filmv]…l[ast]. [filmv] is a substitution group. [filmv] is a substitution group.

16 ….contd If the pattern of amino acids that appear in a column of a block does not match any substitution group, then the motif contains the wild card symbol ‘.’. If the pattern of amino acids that appear in a column of a block does not match any substitution group, then the motif contains the wild card symbol ‘.’. A sequence will or match above motif if it has either an a an s in some position, then any character, then d, k, f & so on, matching until the end of motif A sequence will or match above motif if it has either an a an s in some position, then any character, then d, k, f & so on, matching until the end of motif

17 A sequence x contains a motif m, if x contains m at some position. A sequence x contains a motif m, if x contains m at some position. A sequence x can be represented in vector space indexed by a set of motifs M as follows:- A sequence x can be represented in vector space indexed by a set of motifs M as follows:-  (x) = (  m (x)) mЄM  (x) = (  m (x)) mЄM where  m (x) is the number of occurences of the motif m in x. where  m (x) is the number of occurences of the motif m in x. We can define motif kernel as We can define motif kernel as K(x, x’) =  (x)  (x’) K(x, x’) =  (x)  (x’)

18 As in the most cases a motif appears only once in sequence, this kernel will count the number of motifs that are common to both sequence. As in the most cases a motif appears only once in sequence, this kernel will count the number of motifs that are common to both sequence. Q Why are we using eBlocks database over other motif databases to define a motif kernel? Ans:- 1) Usage of databases like PROSITE & the eMOTIF presents a problem in the evaluation of performance of the kernel. 2) The eBLOCKS database are generated in an unsupervised way 3) Increased coverage of eBLOCKS set of BLOCKS.

19 Computing the Motif Kernel

20 Computing the Motif kernel To compute the motif content of each sequence; the subsequent computation of the kernel is simply a dot product between the vectors. To compute the motif content of each sequence; the subsequent computation of the kernel is simply a dot product between the vectors. To facilitate the efficient computation of the motif content of a sequence, the motif database is stored in TRIE which is defined as follows. To facilitate the efficient computation of the motif content of a sequence, the motif database is stored in TRIE which is defined as follows.  Let m be a motif over the alphabet A U S U {.}  Let m be a motif over the alphabet A U S U {.}  Every prefix of m has a node.  Every prefix of m has a node. Let m1 and m2 be prefixes of m; there is an edge from Let m1 and m2 be prefixes of m; there is an edge from m1 to m2 if lm2l = lm1l +1. m1 to m2 if lm2l = lm1l +1.  To compute all the motifs that are contained in x at any position, this search is started at each position of x.  To compute all the motifs that are contained in x at any position, this search is started at each position of x. __

21 The Blast kernel A query sequence by its BLAST scores against the training set is represented. A query sequence by its BLAST scores against the training set is represented. This representation in conjuction with SVMs was used to address the problem of remote homology detection This representation in conjuction with SVMs was used to address the problem of remote homology detection Results were better than Fisher-kernel method. Results were better than Fisher-kernel method.

22 Classification Methods We report results using two classification methods:- We report results using two classification methods:- 1) SVMs 1) SVMs 2) K-Nearest-Neighbour. 2) K-Nearest-Neighbour. SVM SVM f(x) = w.x + b f(x) = w.x + b w  weight vector w  weight vector b  constant bias b  constant bias Query is classified according to the sign of f.

23 As a consequence of optimization process, the weight vector can be expressed as a weighted sum of the Support Vectors(SV):- As a consequence of optimization process, the weight vector can be expressed as a weighted sum of the Support Vectors(SV):- w =   i x i w =   i x i The decision function is now written as The decision function is now written as f(x) =   i x i * x + b f(x) =   i x i * x + b In terms of kernel function, the decision is expressed as:- In terms of kernel function, the decision is expressed as:- f(x) =   i K(x i, x) + b f(x) =   i K(x i, x) + b i  SV

24 KNN classifier We use a KNN classifier with a continuous valued decision functions. We use a KNN classifier with a continuous valued decision functions. A score for class j is defined as A score for class j is defined as f j (x) =   x i, x) f j (x) =   x i, x) is the set of k nearest neighbors of x kNN j (x) is the set of k nearest neighbors of x in class j. in class j. i  kNN j (x)

25 Metrics We consider two metrics for asessing the performance of a classifier We consider two metrics for asessing the performance of a classifier 1) ROC (area under receiver operator characteristic). 2) RFP (the median rate of false positive) The ROC curve describes the tradeoff between sensitivity and specificity. The ROC curve describes the tradeoff between sensitivity and specificity. More specifically we use ROC50 curve, which counts true positives only up to the first 50 false positives. More specifically we use ROC50 curve, which counts true positives only up to the first 50 false positives. The RFP score of a positive test sequence x is the fraction of negative test sequences that have a value of the decision function that is at least as high as the value of the decision function of x. The RFP score of a positive test sequence x is the fraction of negative test sequences that have a value of the decision function that is at least as high as the value of the decision function of x.

26 Results Use of astral database to obtain protein domain sequences of the SCOP database. Use of astral database to obtain protein domain sequences of the SCOP database. Retained only superfamilies having atleast two families that have atleast 10 members in each family. Retained only superfamilies having atleast two families that have atleast 10 members in each family. A dataset with1639 domains in 23 superfamilies & 56 families was yielded. A dataset with1639 domains in 23 superfamilies & 56 families was yielded. Protein sequences annotated with EC numbers were extracted from SwissProt database. Protein sequences annotated with EC numbers were extracted from SwissProt database. The extracted dataset has 2187 enzymes in 65 classes. The extracted dataset has 2187 enzymes in 65 classes. To generate Blast kernel, authors ran an all vs all BLAST on two datasets using default parameters & E value cut off 0.1. To generate Blast kernel, authors ran an all vs all BLAST on two datasets using default parameters & E value cut off 0.1. To generate motif kernel, datasets were computed with eBLOCKS sequence motifs using the TRIE method. To generate motif kernel, datasets were computed with eBLOCKS sequence motifs using the TRIE method.

27 Results …contd A family by family comparison of classification performance of the motif-SVM & BLAST-SVM methods is provided in figure in next slide. A family by family comparison of classification performance of the motif-SVM & BLAST-SVM methods is provided in figure in next slide. On the SCOP task the motif-SVM method performs significantly better than BLAST-SVM method with a p- value of 3.9 * 10 -9. in a wilcoxon signed rank test for the ROC50 score. On the SCOP task the motif-SVM method performs significantly better than BLAST-SVM method with a p- value of 3.9 * 10 -9. in a wilcoxon signed rank test for the ROC50 score. In enzyme classification task there is no significant difference in ROC50 scores. In enzyme classification task there is no significant difference in ROC50 scores. Similar behavior is observed in the median RFP and RFP50. Similar behavior is observed in the median RFP and RFP50. The results were similar when Smith-Waterman algorithm was used instead of BLAST. The results were similar when Smith-Waterman algorithm was used instead of BLAST.

28 Results …contd

29

30 The motif kernel in figure 4 shows the similarity between the families in superfamily whereas none is detected by the BLAST kernel. The motif kernel in figure 4 shows the similarity between the families in superfamily whereas none is detected by the BLAST kernel. Increased sensitivity of motif kernel. Increased sensitivity of motif kernel.

31 Results …contd

32 Figure 5 shows the comparison of the SVM-based method to the one that uses KNN as a classifier. Figure 5 shows the comparison of the SVM-based method to the one that uses KNN as a classifier. In both the motif and BLAST kernels, SVM based classifier performs significantly better than corresponding KNN classifier. In both the motif and BLAST kernels, SVM based classifier performs significantly better than corresponding KNN classifier.

33 Discussion This paper showed that an SVM classifier that uses motif kernel performs significantly better than SVM that uses a BLAST/Smith-Waterman kernel on a remote homology detection problem derived from SCOP database. This paper showed that an SVM classifier that uses motif kernel performs significantly better than SVM that uses a BLAST/Smith-Waterman kernel on a remote homology detection problem derived from SCOP database. Both methods performed equally well on the task of Enzyme detection. Both methods performed equally well on the task of Enzyme detection. BLAST kernel & motif kernel worked significantly better when used in conjunction with an SVM rather than a Nearest Neighbor classifier. BLAST kernel & motif kernel worked significantly better when used in conjunction with an SVM rather than a Nearest Neighbor classifier. Despite the relative success of motif method, there were many SCOP families & EC classes that were not detected using this method. Despite the relative success of motif method, there were many SCOP families & EC classes that were not detected using this method.

34 Questions?? Comments!! Questions?? Comments!!

35 Thank you !!! Thank you !!!


Download ppt "Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06."

Similar presentations


Ads by Google