Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved.

Similar presentations


Presentation on theme: "Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved."— Presentation transcript:

1 Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved.

2 Overall Goals Find previously unrecognized members of a family Find previously unrecognized members of a family Develop a model of a family Develop a model of a family

3 Possible Approaches Model-based Model-based  Motif-based (MEME/MAST)  Hidden Markov model-based (HMMER) Non-model-based Non-model-based  Family Pairwise Search (FPS)

4 PSSMs Motifs can be summarized and searched for using Position-Specific Scoring Matrices Motifs can be summarized and searched for using Position-Specific Scoring Matrices Calculated from a multiple alignment of a conserved region for members of a family Calculated from a multiple alignment of a conserved region for members of a family

5 Learning PSSMs Unsupervised learning methods can be used to find motifs in unaligned sequences Unsupervised learning methods can be used to find motifs in unaligned sequences Best characterized algorithm is MEME Best characterized algorithm is MEME  T.L. Bailey & C. Elkan (1995) Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning J. 21:51-83

6 Problems with PSSMs Some families are characterized by two or more “sub”-motifs with variable spacing between them Some families are characterized by two or more “sub”-motifs with variable spacing between them Deciding upon motif boundaries difficult Deciding upon motif boundaries difficult Possible information in intervening sequences lost if only motifs are used Possible information in intervening sequences lost if only motifs are used

7 Cobbling Pick “most representative” protein sequence from a family Pick “most representative” protein sequence from a family Convert it to a profile by replacing each amino acid by the corresponding column from a similarity matrix Convert it to a profile by replacing each amino acid by the corresponding column from a similarity matrix

8 Cobbling For each recognized “motif” in the family, replace the corresponding section of the profile with the profile of the motif For each recognized “motif” in the family, replace the corresponding section of the profile with the profile of the motif

9 Cobbling Advantage: At least some sequence information between motifs is retained. Advantage: At least some sequence information between motifs is retained. S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of information from multiple sequence alignments. Protein Science 6:698-705 S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of information from multiple sequence alignments. Protein Science 6:698-705

10 Cobbler Illustration scores from profiles of conserved motifs similarity scores for sequence from “most representative” family member sequence of “most representative” family member

11 Family Pairwise Search For all known members of family, calculate (pairwise) homology to each sequence in database (using BLAST) and sum those scores For all known members of family, calculate (pairwise) homology to each sequence in database (using BLAST) and sum those scores

12 Family Pairwise Search Does not generate a model of the motif Does not generate a model of the motif Analogous to k nearest neighbor classification Analogous to k nearest neighbor classification

13 Which method is best? Compare BLAST using a randomly chosen family member, BLAST FPS, MEME, HMMER Compare BLAST using a randomly chosen family member, BLAST FPS, MEME, HMMER W.N. Gundy (1998) Homology Detection via Family Pairwise Search. J. Comput. Biol. 5:479-492 W.N. Gundy (1998) Homology Detection via Family Pairwise Search. J. Comput. Biol. 5:479-492

14 Comparison Protocol For each method For each method  For each known protein family  Train with family members  Search database for matches  Rank by score from search  Determine how many known family members are ranked highly

15 Comparison Protocol Evaluation metric Evaluation metric  average ROC 50  ROC 50 is the fraction of true positives detected at a threshold giving 50 false negatives  average over all families  Bigger is better!

16 Comparison Protocol Caution! Caution!  True positive defined as being listed as a member of the family in the PROSITE compilation  Some false positives could be actual family members that were missed during PROSITE compilation!  (Should be minor effect)

17 Results BLAST FPS BLAST HMMER MAST

18 Conclusion FPS better than single sequence BLAST FPS better than single sequence BLAST FPS better than model-based methods FPS better than model-based methods

19 Which is best (part 2)? Compare BLAST, BLAST FPS, cobbled BLAST, cobbled BLAST FPS Compare BLAST, BLAST FPS, cobbled BLAST, cobbled BLAST FPS W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded motif models. Bioinformatics 15:463-470 W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded motif models. Bioinformatics 15:463-470

20 Comparison Protocol Evaluation metric Evaluation metric  rank sum  calculate difference in ROC 50 for two methods for a given family  sort by absolute value of difference  sum ranks of families for which one method is better than the other  Bigger is better!

21 Results

22 Conclusion For task of finding members of a family given a reasonable number of known members of that family, cobbled FPS is best currently available method! For task of finding members of a family given a reasonable number of known members of that family, cobbled FPS is best currently available method!


Download ppt "Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved."

Similar presentations


Ads by Google