Presentation is loading. Please wait.

Presentation is loading. Please wait.

09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

Similar presentations


Presentation on theme: "09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron."— Presentation transcript:

1 09 / 23 / 2005eisner@cs.ualberta.ca1 Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu

2 09 / 23 / 2005 eisner@cs.ualberta.ca 2 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

3 09 / 23 / 2005 eisner@cs.ualberta.ca 3

4 09 / 23 / 2005 eisner@cs.ualberta.ca 4 Proteins Functional Units in the cell Perform a Variety of Functions  e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules Can take years to study a single protein  Any good leads would be helpful!

5 09 / 23 / 2005 eisner@cs.ualberta.ca 5 Protein Function Prediction and Protein Function Determination Prediction:  An estimate of what function a protein performs Determination:  Work in a laboratory to observe and discover what function a protein performs Prediction complements determination

6 09 / 23 / 2005 eisner@cs.ualberta.ca 6 Proteins Chain of amino acids  20 Amino Acids FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI

7 09 / 23 / 2005 eisner@cs.ualberta.ca 7 Ontologies Standardized Vocabularies (Common Language) In biological literature, different terms can be used to describe the same function  e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” Can be structured in a hierarchy to show relationships

8 09 / 23 / 2005 eisner@cs.ualberta.ca 8 Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:  Molecular Function  Biological Process  Cellular Component

9 09 / 23 / 2005 eisner@cs.ualberta.ca 9 Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:  Molecular Function  Biological Process  Cellular Component

10 09 / 23 / 2005 eisner@cs.ualberta.ca 10 Hierarchical Ontologies Can help to represent a large number of classes Represent General and Specific data Some data is incomplete – could become more specific in the future

11 09 / 23 / 2005 eisner@cs.ualberta.ca 11 Incomplete Annotations

12 09 / 23 / 2005 eisner@cs.ualberta.ca 12 Goal To predict the function of proteins given their sequence

13 09 / 23 / 2005 eisner@cs.ualberta.ca 13 Data Set Protein Sequences  UniProt database Ontology  Gene Ontology Molecular Function aspect Experimental Annotations  Gene Ontology Annotation project @ EBI Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins Final Data Set: 14,362 proteins

14 09 / 23 / 2005 eisner@cs.ualberta.ca 14 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

15 09 / 23 / 2005 eisner@cs.ualberta.ca 15 Predictors Global:  BLAST NN Local:  PA-SVM  PFAM-SVM  Probabilistic Suffix Trees

16 09 / 23 / 2005 eisner@cs.ualberta.ca 16 Predictors Global:  BLAST NN Local:  PA-SVM  PFAM-SVM  Probabilistic Suffix Trees Linear

17 09 / 23 / 2005 eisner@cs.ualberta.ca 17 Why Linear SVMs? Accurate Explainability Each term in the dot product in meaningful

18 09 / 23 / 2005 eisner@cs.ualberta.ca 18 PA-SVM Proteome Analyst

19 09 / 23 / 2005 eisner@cs.ualberta.ca 19 PFAM-SVM Hidden Markov Models

20 09 / 23 / 2005 eisner@cs.ualberta.ca 20 PST Probabilistic Suffix Trees  Efficient Markov chains Model the protein sequences directly: Prediction:

21 09 / 23 / 2005 eisner@cs.ualberta.ca 21 BLAST Protein Sequence Alignment for a query protein against any set of protein sequences

22 09 / 23 / 2005 eisner@cs.ualberta.ca 22 BLAST

23 09 / 23 / 2005 eisner@cs.ualberta.ca 23 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

24 09 / 23 / 2005 eisner@cs.ualberta.ca 24 Evaluating Predictions in a Hierarchy Not all errors are equivalent  Error to sibling different than error to unrelated part of hierarchy Proteins can perform more than one function  Need to combine predictions of multiple functions into a single measure

25 09 / 23 / 2005 eisner@cs.ualberta.ca 25 Evaluating Predictions in a Hierarchy Semantics of the hierarchy – True Path Rule Protein labeled with: {T} -> {T, A 1, A 2 } Predicted functions: {S} -> {S, A 1, A 2 } Precision = 2/3 = 67% Recall = 2/3 = 67%

26 09 / 23 / 2005 eisner@cs.ualberta.ca 26 Evaluating Predictions in a Hierarchy Protein labelled with {T} -> {T, A 1, A 2 } Predicted: {C 1 } -> {C 1, T, A 1, A 2 } Precision = 3/4 = 75% Recall = 3/3 = 100%

27 09 / 23 / 2005 eisner@cs.ualberta.ca 27 Supervised Learning

28 09 / 23 / 2005 eisner@cs.ualberta.ca 28 Cross-Validation Used to estimate performance of classification system on future data 5 Fold Cross- Validation:

29 09 / 23 / 2005 eisner@cs.ualberta.ca 29 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

30 09 / 23 / 2005 eisner@cs.ualberta.ca 30 Inclusive vs Exclusive Local Predictors In a system of local predictors, how should each local predictor behave? Two extremes:  A local predictor predicts positive only for those proteins that belong exactly at that node  A local predictor predicts positive for those proteins that belong at or below them in the hierarchy No a priori reason to choose either

31 09 / 23 / 2005 eisner@cs.ualberta.ca 31 Exclusive Local Predictors

32 09 / 23 / 2005 eisner@cs.ualberta.ca 32 Inclusive Local Predictors

33 09 / 23 / 2005 eisner@cs.ualberta.ca 33 Training Set Design Proteins in the current fold’s training set can be used in any way Need to select for each local predictor:  Positive training examples  Negative training examples

34 09 / 23 / 2005 eisner@cs.ualberta.ca 34 Training Set Design

35 09 / 23 / 2005 eisner@cs.ualberta.ca 35 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

36 09 / 23 / 2005 eisner@cs.ualberta.ca 36 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

37 09 / 23 / 2005 eisner@cs.ualberta.ca 37 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

38 09 / 23 / 2005 eisner@cs.ualberta.ca 38 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

39 09 / 23 / 2005 eisner@cs.ualberta.ca 39 Comparing Training Set Design Schemes Using PA-SVM MethodPrecisionRecallF1-Measure Exceptions per Protein Exclusive75.8%32.8%45.8%1.52 Less Exclusive 77.7%40.4%53.1%1.74 Less Inclusive 77.3%63.8%69.9%0.05 Inclusive75.3%65.2%69.9%0.09

40 09 / 23 / 2005 eisner@cs.ualberta.ca 40 Exclusive have more exceptions

41 09 / 23 / 2005 eisner@cs.ualberta.ca 41 Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

42 09 / 23 / 2005 eisner@cs.ualberta.ca 42 Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

43 09 / 23 / 2005 eisner@cs.ualberta.ca 43 Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

44 09 / 23 / 2005 eisner@cs.ualberta.ca 44 Top-Down Search Method Previous F1-Measure Top-Down F1-Measure Number of Local Predictors Computed Exclusive45.8%0.4%10 Less Exclusive 53.1%2.7%10 Less Inclusive 69.9%69.8%32 Inclusive69.9% 32

45 09 / 23 / 2005 eisner@cs.ualberta.ca 45 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

46 09 / 23 / 2005 eisner@cs.ualberta.ca 46 Predictor Results PredictorPrecisionRecall PA-SVM75.4%64.8% PFAM-SVM74.0%57.5% PST57.5%63.6% BLAST76.7%69.6% Voting76.3%73.3%

47 09 / 23 / 2005 eisner@cs.ualberta.ca 47 Similar and Dissimilar Proteins 89% of proteins – at least one good BLAST hit  Proteins which are similar (often homologous) to the set of well studied proteins 11% of proteins – no good BLAST hit  Proteins which are not similar to the set of well studied proteins

48 09 / 23 / 2005 eisner@cs.ualberta.ca 48 Coverage Coverage: Percentage of proteins for which a prediction is made OrganismGood BLAST HitNo Good BLAST Hit D. Melanogaster60%40% S. Cerevisae62%38%

49 09 / 23 / 2005 eisner@cs.ualberta.ca 49 Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found  Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have:  High Recall  Minimal Size

50 09 / 23 / 2005 eisner@cs.ualberta.ca 50 Similar Proteins – Exploiting BLAST candidate nodes generating methods:  Searching outward from BLAST hit  Performing the union of more than one BLAST hit’s annotations

51 09 / 23 / 2005 eisner@cs.ualberta.ca 51 Similar Proteins – Exploiting BLAST MethodPrecisionRecall Avg Cost per Protein All77%80%1219 Top-Down77%79%111 BLAST-2-Union79%78%20 BLAST-Search-378% 221

52 09 / 23 / 2005 eisner@cs.ualberta.ca 52 Dissimilar Proteins MethodPrecisionRecall Avg Cost per Protein BLAST19%20%1 Voting55%32%812 Top-Down Voting56%32%58 The more interesting case

53 09 / 23 / 2005 eisner@cs.ualberta.ca 53 Comparison to Protfun On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins PrecisionRecall Protfun14%13% Voting69%29%

54 09 / 23 / 2005 eisner@cs.ualberta.ca 54 Future Work Try other two ontologies – biological process and cellular component Use other local predictors More parameter tuning Predictor cost

55 09 / 23 / 2005 eisner@cs.ualberta.ca 55 Conclusion Protein Function Prediction provides good leads for Protein Function Determination Hierarchical ontologies can represent incomplete data allowing the prediction of more functions Considering the hierarchy:  More accurate & Less Computationally Intensive Methods presented have a higher coverage than BLAST alone Results accepted to IEEE CIBCB 2005

56 09 / 23 / 2005 eisner@cs.ualberta.ca 56 Thanks to… Duane Szafron and Paul Lu Brett Poulin and Russ Greiner Everyone in the Proteome Analyst research group

57 09 / 23 / 2005 eisner@cs.ualberta.ca 57 Incomplete Data & Prediction Inclusive avoids using ambiguous (incomplete) training data Does this help? To test:  Train on more Incomplete Data: Choose X% of proteins, and move one annotation up  Evaluation Predictions on “Complete” data

58 09 / 23 / 2005 eisner@cs.ualberta.ca 58 Robustness to Incomplete Data

59 09 / 23 / 2005 eisner@cs.ualberta.ca 59 Local vs Global Cross-Validation Some node predictors have as little as 20 positive examples How to do cross-validation to make sure each predictor has enough positive training examples?

60 09 / 23 / 2005 eisner@cs.ualberta.ca 60 Local vs Global Cross-Validation Local cross-validation is invalid  Predictions must be consistent  Need fold isolation A single global split  global cross-validation


Download ppt "09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron."

Similar presentations


Ads by Google