Download presentation
Presentation is loading. Please wait.
1
09 / 23 / 2005eisner@cs.ualberta.ca1 Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu
2
09 / 23 / 2005 eisner@cs.ualberta.ca 2 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
3
09 / 23 / 2005 eisner@cs.ualberta.ca 3
4
09 / 23 / 2005 eisner@cs.ualberta.ca 4 Proteins Functional Units in the cell Perform a Variety of Functions e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules Can take years to study a single protein Any good leads would be helpful!
5
09 / 23 / 2005 eisner@cs.ualberta.ca 5 Protein Function Prediction and Protein Function Determination Prediction: An estimate of what function a protein performs Determination: Work in a laboratory to observe and discover what function a protein performs Prediction complements determination
6
09 / 23 / 2005 eisner@cs.ualberta.ca 6 Proteins Chain of amino acids 20 Amino Acids FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI
7
09 / 23 / 2005 eisner@cs.ualberta.ca 7 Ontologies Standardized Vocabularies (Common Language) In biological literature, different terms can be used to describe the same function e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” Can be structured in a hierarchy to show relationships
8
09 / 23 / 2005 eisner@cs.ualberta.ca 8 Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations: Molecular Function Biological Process Cellular Component
9
09 / 23 / 2005 eisner@cs.ualberta.ca 9 Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations: Molecular Function Biological Process Cellular Component
10
09 / 23 / 2005 eisner@cs.ualberta.ca 10 Hierarchical Ontologies Can help to represent a large number of classes Represent General and Specific data Some data is incomplete – could become more specific in the future
11
09 / 23 / 2005 eisner@cs.ualberta.ca 11 Incomplete Annotations
12
09 / 23 / 2005 eisner@cs.ualberta.ca 12 Goal To predict the function of proteins given their sequence
13
09 / 23 / 2005 eisner@cs.ualberta.ca 13 Data Set Protein Sequences UniProt database Ontology Gene Ontology Molecular Function aspect Experimental Annotations Gene Ontology Annotation project @ EBI Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins Final Data Set: 14,362 proteins
14
09 / 23 / 2005 eisner@cs.ualberta.ca 14 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
15
09 / 23 / 2005 eisner@cs.ualberta.ca 15 Predictors Global: BLAST NN Local: PA-SVM PFAM-SVM Probabilistic Suffix Trees
16
09 / 23 / 2005 eisner@cs.ualberta.ca 16 Predictors Global: BLAST NN Local: PA-SVM PFAM-SVM Probabilistic Suffix Trees Linear
17
09 / 23 / 2005 eisner@cs.ualberta.ca 17 Why Linear SVMs? Accurate Explainability Each term in the dot product in meaningful
18
09 / 23 / 2005 eisner@cs.ualberta.ca 18 PA-SVM Proteome Analyst
19
09 / 23 / 2005 eisner@cs.ualberta.ca 19 PFAM-SVM Hidden Markov Models
20
09 / 23 / 2005 eisner@cs.ualberta.ca 20 PST Probabilistic Suffix Trees Efficient Markov chains Model the protein sequences directly: Prediction:
21
09 / 23 / 2005 eisner@cs.ualberta.ca 21 BLAST Protein Sequence Alignment for a query protein against any set of protein sequences
22
09 / 23 / 2005 eisner@cs.ualberta.ca 22 BLAST
23
09 / 23 / 2005 eisner@cs.ualberta.ca 23 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
24
09 / 23 / 2005 eisner@cs.ualberta.ca 24 Evaluating Predictions in a Hierarchy Not all errors are equivalent Error to sibling different than error to unrelated part of hierarchy Proteins can perform more than one function Need to combine predictions of multiple functions into a single measure
25
09 / 23 / 2005 eisner@cs.ualberta.ca 25 Evaluating Predictions in a Hierarchy Semantics of the hierarchy – True Path Rule Protein labeled with: {T} -> {T, A 1, A 2 } Predicted functions: {S} -> {S, A 1, A 2 } Precision = 2/3 = 67% Recall = 2/3 = 67%
26
09 / 23 / 2005 eisner@cs.ualberta.ca 26 Evaluating Predictions in a Hierarchy Protein labelled with {T} -> {T, A 1, A 2 } Predicted: {C 1 } -> {C 1, T, A 1, A 2 } Precision = 3/4 = 75% Recall = 3/3 = 100%
27
09 / 23 / 2005 eisner@cs.ualberta.ca 27 Supervised Learning
28
09 / 23 / 2005 eisner@cs.ualberta.ca 28 Cross-Validation Used to estimate performance of classification system on future data 5 Fold Cross- Validation:
29
09 / 23 / 2005 eisner@cs.ualberta.ca 29 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
30
09 / 23 / 2005 eisner@cs.ualberta.ca 30 Inclusive vs Exclusive Local Predictors In a system of local predictors, how should each local predictor behave? Two extremes: A local predictor predicts positive only for those proteins that belong exactly at that node A local predictor predicts positive for those proteins that belong at or below them in the hierarchy No a priori reason to choose either
31
09 / 23 / 2005 eisner@cs.ualberta.ca 31 Exclusive Local Predictors
32
09 / 23 / 2005 eisner@cs.ualberta.ca 32 Inclusive Local Predictors
33
09 / 23 / 2005 eisner@cs.ualberta.ca 33 Training Set Design Proteins in the current fold’s training set can be used in any way Need to select for each local predictor: Positive training examples Negative training examples
34
09 / 23 / 2005 eisner@cs.ualberta.ca 34 Training Set Design
35
09 / 23 / 2005 eisner@cs.ualberta.ca 35 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
36
09 / 23 / 2005 eisner@cs.ualberta.ca 36 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
37
09 / 23 / 2005 eisner@cs.ualberta.ca 37 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
38
09 / 23 / 2005 eisner@cs.ualberta.ca 38 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
39
09 / 23 / 2005 eisner@cs.ualberta.ca 39 Comparing Training Set Design Schemes Using PA-SVM MethodPrecisionRecallF1-Measure Exceptions per Protein Exclusive75.8%32.8%45.8%1.52 Less Exclusive 77.7%40.4%53.1%1.74 Less Inclusive 77.3%63.8%69.9%0.05 Inclusive75.3%65.2%69.9%0.09
40
09 / 23 / 2005 eisner@cs.ualberta.ca 40 Exclusive have more exceptions
41
09 / 23 / 2005 eisner@cs.ualberta.ca 41 Lowering the Cost of Local Predictors Top-Down Compute local predictors top to bottom until a negative prediction is reached
42
09 / 23 / 2005 eisner@cs.ualberta.ca 42 Lowering the Cost of Local Predictors Top-Down Compute local predictors top to bottom until a negative prediction is reached
43
09 / 23 / 2005 eisner@cs.ualberta.ca 43 Lowering the Cost of Local Predictors Top-Down Compute local predictors top to bottom until a negative prediction is reached
44
09 / 23 / 2005 eisner@cs.ualberta.ca 44 Top-Down Search Method Previous F1-Measure Top-Down F1-Measure Number of Local Predictors Computed Exclusive45.8%0.4%10 Less Exclusive 53.1%2.7%10 Less Inclusive 69.9%69.8%32 Inclusive69.9% 32
45
09 / 23 / 2005 eisner@cs.ualberta.ca 45 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
46
09 / 23 / 2005 eisner@cs.ualberta.ca 46 Predictor Results PredictorPrecisionRecall PA-SVM75.4%64.8% PFAM-SVM74.0%57.5% PST57.5%63.6% BLAST76.7%69.6% Voting76.3%73.3%
47
09 / 23 / 2005 eisner@cs.ualberta.ca 47 Similar and Dissimilar Proteins 89% of proteins – at least one good BLAST hit Proteins which are similar (often homologous) to the set of well studied proteins 11% of proteins – no good BLAST hit Proteins which are not similar to the set of well studied proteins
48
09 / 23 / 2005 eisner@cs.ualberta.ca 48 Coverage Coverage: Percentage of proteins for which a prediction is made OrganismGood BLAST HitNo Good BLAST Hit D. Melanogaster60%40% S. Cerevisae62%38%
49
09 / 23 / 2005 eisner@cs.ualberta.ca 49 Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have: High Recall Minimal Size
50
09 / 23 / 2005 eisner@cs.ualberta.ca 50 Similar Proteins – Exploiting BLAST candidate nodes generating methods: Searching outward from BLAST hit Performing the union of more than one BLAST hit’s annotations
51
09 / 23 / 2005 eisner@cs.ualberta.ca 51 Similar Proteins – Exploiting BLAST MethodPrecisionRecall Avg Cost per Protein All77%80%1219 Top-Down77%79%111 BLAST-2-Union79%78%20 BLAST-Search-378% 221
52
09 / 23 / 2005 eisner@cs.ualberta.ca 52 Dissimilar Proteins MethodPrecisionRecall Avg Cost per Protein BLAST19%20%1 Voting55%32%812 Top-Down Voting56%32%58 The more interesting case
53
09 / 23 / 2005 eisner@cs.ualberta.ca 53 Comparison to Protfun On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins PrecisionRecall Protfun14%13% Voting69%29%
54
09 / 23 / 2005 eisner@cs.ualberta.ca 54 Future Work Try other two ontologies – biological process and cellular component Use other local predictors More parameter tuning Predictor cost
55
09 / 23 / 2005 eisner@cs.ualberta.ca 55 Conclusion Protein Function Prediction provides good leads for Protein Function Determination Hierarchical ontologies can represent incomplete data allowing the prediction of more functions Considering the hierarchy: More accurate & Less Computationally Intensive Methods presented have a higher coverage than BLAST alone Results accepted to IEEE CIBCB 2005
56
09 / 23 / 2005 eisner@cs.ualberta.ca 56 Thanks to… Duane Szafron and Paul Lu Brett Poulin and Russ Greiner Everyone in the Proteome Analyst research group
57
09 / 23 / 2005 eisner@cs.ualberta.ca 57 Incomplete Data & Prediction Inclusive avoids using ambiguous (incomplete) training data Does this help? To test: Train on more Incomplete Data: Choose X% of proteins, and move one annotation up Evaluation Predictions on “Complete” data
58
09 / 23 / 2005 eisner@cs.ualberta.ca 58 Robustness to Incomplete Data
59
09 / 23 / 2005 eisner@cs.ualberta.ca 59 Local vs Global Cross-Validation Some node predictors have as little as 20 positive examples How to do cross-validation to make sure each predictor has enough positive training examples?
60
09 / 23 / 2005 eisner@cs.ualberta.ca 60 Local vs Global Cross-Validation Local cross-validation is invalid Predictions must be consistent Need fold isolation A single global split global cross-validation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.