09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

09 / 23 / 2005eisner@cs.ualberta.ca1 Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu

09 / 23 / 2005 eisner@cs.ualberta.ca 2 Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / 2005 eisner@cs.ualberta.ca 3

09 / 23 / 2005 eisner@cs.ualberta.ca 4 Proteins Functional Units in the cell Perform a Variety of Functions  e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules Can take years to study a single protein  Any good leads would be helpful!

09 / 23 / 2005 eisner@cs.ualberta.ca 5 Protein Function Prediction and Protein Function Determination Prediction:  An estimate of what function a protein performs Determination:  Work in a laboratory to observe and discover what function a protein performs Prediction complements determination

09 / 23 / 2005 eisner@cs.ualberta.ca 6 Proteins Chain of amino acids  20 Amino Acids FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI

09 / 23 / 2005 eisner@cs.ualberta.ca 7 Ontologies Standardized Vocabularies (Common Language) In biological literature, different terms can be used to describe the same function  e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” Can be structured in a hierarchy to show relationships

09 / 23 / 2005 eisner@cs.ualberta.ca 8 Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:  Molecular Function  Biological Process  Cellular Component

09 / 23 / 2005 eisner@cs.ualberta.ca 9 Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:  Molecular Function  Biological Process  Cellular Component

09 / 23 / 2005 eisner@cs.ualberta.ca 10 Hierarchical Ontologies Can help to represent a large number of classes Represent General and Specific data Some data is incomplete – could become more specific in the future

09 / 23 / 2005 eisner@cs.ualberta.ca 11 Incomplete Annotations

09 / 23 / 2005 eisner@cs.ualberta.ca 12 Goal To predict the function of proteins given their sequence

09 / 23 / 2005 eisner@cs.ualberta.ca 13 Data Set Protein Sequences  UniProt database Ontology  Gene Ontology Molecular Function aspect Experimental Annotations  Gene Ontology Annotation project @ EBI Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins Final Data Set: 14,362 proteins

09 / 23 / 2005 eisner@cs.ualberta.ca 15 Predictors Global:  BLAST NN Local:  PA-SVM  PFAM-SVM  Probabilistic Suffix Trees

09 / 23 / 2005 eisner@cs.ualberta.ca 16 Predictors Global:  BLAST NN Local:  PA-SVM  PFAM-SVM  Probabilistic Suffix Trees Linear

09 / 23 / 2005 eisner@cs.ualberta.ca 17 Why Linear SVMs? Accurate Explainability Each term in the dot product in meaningful

09 / 23 / 2005 eisner@cs.ualberta.ca 18 PA-SVM Proteome Analyst

09 / 23 / 2005 eisner@cs.ualberta.ca 19 PFAM-SVM Hidden Markov Models

09 / 23 / 2005 eisner@cs.ualberta.ca 20 PST Probabilistic Suffix Trees  Efficient Markov chains Model the protein sequences directly: Prediction:

09 / 23 / 2005 eisner@cs.ualberta.ca 21 BLAST Protein Sequence Alignment for a query protein against any set of protein sequences

09 / 23 / 2005 eisner@cs.ualberta.ca 22 BLAST

09 / 23 / 2005 eisner@cs.ualberta.ca 24 Evaluating Predictions in a Hierarchy Not all errors are equivalent  Error to sibling different than error to unrelated part of hierarchy Proteins can perform more than one function  Need to combine predictions of multiple functions into a single measure

09 / 23 / 2005 eisner@cs.ualberta.ca 25 Evaluating Predictions in a Hierarchy Semantics of the hierarchy – True Path Rule Protein labeled with: {T} -> {T, A 1, A 2 } Predicted functions: {S} -> {S, A 1, A 2 } Precision = 2/3 = 67% Recall = 2/3 = 67%

09 / 23 / 2005 eisner@cs.ualberta.ca 26 Evaluating Predictions in a Hierarchy Protein labelled with {T} -> {T, A 1, A 2 } Predicted: {C 1 } -> {C 1, T, A 1, A 2 } Precision = 3/4 = 75% Recall = 3/3 = 100%

09 / 23 / 2005 eisner@cs.ualberta.ca 27 Supervised Learning

09 / 23 / 2005 eisner@cs.ualberta.ca 28 Cross-Validation Used to estimate performance of classification system on future data 5 Fold Cross- Validation:

09 / 23 / 2005 eisner@cs.ualberta.ca 30 Inclusive vs Exclusive Local Predictors In a system of local predictors, how should each local predictor behave? Two extremes:  A local predictor predicts positive only for those proteins that belong exactly at that node  A local predictor predicts positive for those proteins that belong at or below them in the hierarchy No a priori reason to choose either

09 / 23 / 2005 eisner@cs.ualberta.ca 31 Exclusive Local Predictors

09 / 23 / 2005 eisner@cs.ualberta.ca 32 Inclusive Local Predictors

09 / 23 / 2005 eisner@cs.ualberta.ca 33 Training Set Design Proteins in the current fold’s training set can be used in any way Need to select for each local predictor:  Positive training examples  Negative training examples

09 / 23 / 2005 eisner@cs.ualberta.ca 34 Training Set Design

09 / 23 / 2005 eisner@cs.ualberta.ca 35 Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / 2005 eisner@cs.ualberta.ca 39 Comparing Training Set Design Schemes Using PA-SVM MethodPrecisionRecallF1-Measure Exceptions per Protein Exclusive75.8%32.8%45.8%1.52 Less Exclusive 77.7%40.4%53.1%1.74 Less Inclusive 77.3%63.8%69.9%0.05 Inclusive75.3%65.2%69.9%0.09

09 / 23 / 2005 eisner@cs.ualberta.ca 40 Exclusive have more exceptions

09 / 23 / 2005 eisner@cs.ualberta.ca 41 Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / 2005 eisner@cs.ualberta.ca 44 Top-Down Search Method Previous F1-Measure Top-Down F1-Measure Number of Local Predictors Computed Exclusive45.8%0.4%10 Less Exclusive 53.1%2.7%10 Less Inclusive 69.9%69.8%32 Inclusive69.9% 32

09 / 23 / 2005 eisner@cs.ualberta.ca 46 Predictor Results PredictorPrecisionRecall PA-SVM75.4%64.8% PFAM-SVM74.0%57.5% PST57.5%63.6% BLAST76.7%69.6% Voting76.3%73.3%

09 / 23 / 2005 eisner@cs.ualberta.ca 47 Similar and Dissimilar Proteins 89% of proteins – at least one good BLAST hit  Proteins which are similar (often homologous) to the set of well studied proteins 11% of proteins – no good BLAST hit  Proteins which are not similar to the set of well studied proteins

09 / 23 / 2005 eisner@cs.ualberta.ca 48 Coverage Coverage: Percentage of proteins for which a prediction is made OrganismGood BLAST HitNo Good BLAST Hit D. Melanogaster60%40% S. Cerevisae62%38%

09 / 23 / 2005 eisner@cs.ualberta.ca 49 Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found  Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have:  High Recall  Minimal Size

09 / 23 / 2005 eisner@cs.ualberta.ca 50 Similar Proteins – Exploiting BLAST candidate nodes generating methods:  Searching outward from BLAST hit  Performing the union of more than one BLAST hit’s annotations

09 / 23 / 2005 eisner@cs.ualberta.ca 51 Similar Proteins – Exploiting BLAST MethodPrecisionRecall Avg Cost per Protein All77%80%1219 Top-Down77%79%111 BLAST-2-Union79%78%20 BLAST-Search-378% 221

09 / 23 / 2005 eisner@cs.ualberta.ca 52 Dissimilar Proteins MethodPrecisionRecall Avg Cost per Protein BLAST19%20%1 Voting55%32%812 Top-Down Voting56%32%58 The more interesting case

09 / 23 / 2005 eisner@cs.ualberta.ca 53 Comparison to Protfun On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins PrecisionRecall Protfun14%13% Voting69%29%

09 / 23 / 2005 eisner@cs.ualberta.ca 54 Future Work Try other two ontologies – biological process and cellular component Use other local predictors More parameter tuning Predictor cost

09 / 23 / 2005 eisner@cs.ualberta.ca 55 Conclusion Protein Function Prediction provides good leads for Protein Function Determination Hierarchical ontologies can represent incomplete data allowing the prediction of more functions Considering the hierarchy:  More accurate & Less Computationally Intensive Methods presented have a higher coverage than BLAST alone Results accepted to IEEE CIBCB 2005

09 / 23 / 2005 eisner@cs.ualberta.ca 56 Thanks to… Duane Szafron and Paul Lu Brett Poulin and Russ Greiner Everyone in the Proteome Analyst research group

09 / 23 / 2005 eisner@cs.ualberta.ca 57 Incomplete Data & Prediction Inclusive avoids using ambiguous (incomplete) training data Does this help? To test:  Train on more Incomplete Data: Choose X% of proteins, and move one annotation up  Evaluation Predictions on “Complete” data

09 / 23 / 2005 eisner@cs.ualberta.ca 58 Robustness to Incomplete Data

09 / 23 / 2005 eisner@cs.ualberta.ca 59 Local vs Global Cross-Validation Some node predictors have as little as 20 positive examples How to do cross-validation to make sure each predictor has enough positive training examples?

09 / 23 / 2005 eisner@cs.ualberta.ca 60 Local vs Global Cross-Validation Local cross-validation is invalid  Predictions must be consistent  Need fold isolation A single global split  global cross-validation

09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

Similar presentations

Presentation on theme: "09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

Similar presentations

Presentation on theme: "09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron."— Presentation transcript:

Similar presentations

About project

Feedback