09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu
09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 /
09 / 23 / Proteins Functional Units in the cell Perform a Variety of Functions e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules Can take years to study a single protein Any good leads would be helpful!
09 / 23 / Protein Function Prediction and Protein Function Determination Prediction: An estimate of what function a protein performs Determination: Work in a laboratory to observe and discover what function a protein performs Prediction complements determination
09 / 23 / Proteins Chain of amino acids 20 Amino Acids FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI
09 / 23 / Ontologies Standardized Vocabularies (Common Language) In biological literature, different terms can be used to describe the same function e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” Can be structured in a hierarchy to show relationships
09 / 23 / Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations: Molecular Function Biological Process Cellular Component
09 / 23 / Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations: Molecular Function Biological Process Cellular Component
09 / 23 / Hierarchical Ontologies Can help to represent a large number of classes Represent General and Specific data Some data is incomplete – could become more specific in the future
09 / 23 / Incomplete Annotations
09 / 23 / Goal To predict the function of proteins given their sequence
09 / 23 / Data Set Protein Sequences UniProt database Ontology Gene Ontology Molecular Function aspect Experimental Annotations Gene Ontology Annotation EBI Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins Final Data Set: 14,362 proteins
09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / Predictors Global: BLAST NN Local: PA-SVM PFAM-SVM Probabilistic Suffix Trees
09 / 23 / Predictors Global: BLAST NN Local: PA-SVM PFAM-SVM Probabilistic Suffix Trees Linear
09 / 23 / Why Linear SVMs? Accurate Explainability Each term in the dot product in meaningful
09 / 23 / PA-SVM Proteome Analyst
09 / 23 / PFAM-SVM Hidden Markov Models
09 / 23 / PST Probabilistic Suffix Trees Efficient Markov chains Model the protein sequences directly: Prediction:
09 / 23 / BLAST Protein Sequence Alignment for a query protein against any set of protein sequences
09 / 23 / BLAST
09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / Evaluating Predictions in a Hierarchy Not all errors are equivalent Error to sibling different than error to unrelated part of hierarchy Proteins can perform more than one function Need to combine predictions of multiple functions into a single measure
09 / 23 / Evaluating Predictions in a Hierarchy Semantics of the hierarchy – True Path Rule Protein labeled with: {T} -> {T, A 1, A 2 } Predicted functions: {S} -> {S, A 1, A 2 } Precision = 2/3 = 67% Recall = 2/3 = 67%
09 / 23 / Evaluating Predictions in a Hierarchy Protein labelled with {T} -> {T, A 1, A 2 } Predicted: {C 1 } -> {C 1, T, A 1, A 2 } Precision = 3/4 = 75% Recall = 3/3 = 100%
09 / 23 / Supervised Learning
09 / 23 / Cross-Validation Used to estimate performance of classification system on future data 5 Fold Cross- Validation:
09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / Inclusive vs Exclusive Local Predictors In a system of local predictors, how should each local predictor behave? Two extremes: A local predictor predicts positive only for those proteins that belong exactly at that node A local predictor predicts positive for those proteins that belong at or below them in the hierarchy No a priori reason to choose either
09 / 23 / Exclusive Local Predictors
09 / 23 / Inclusive Local Predictors
09 / 23 / Training Set Design Proteins in the current fold’s training set can be used in any way Need to select for each local predictor: Positive training examples Negative training examples
09 / 23 / Training Set Design
09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / Comparing Training Set Design Schemes Using PA-SVM MethodPrecisionRecallF1-Measure Exceptions per Protein Exclusive75.8%32.8%45.8%1.52 Less Exclusive 77.7%40.4%53.1%1.74 Less Inclusive 77.3%63.8%69.9%0.05 Inclusive75.3%65.2%69.9%0.09
09 / 23 / Exclusive have more exceptions
09 / 23 / Lowering the Cost of Local Predictors Top-Down Compute local predictors top to bottom until a negative prediction is reached
09 / 23 / Lowering the Cost of Local Predictors Top-Down Compute local predictors top to bottom until a negative prediction is reached
09 / 23 / Lowering the Cost of Local Predictors Top-Down Compute local predictors top to bottom until a negative prediction is reached
09 / 23 / Top-Down Search Method Previous F1-Measure Top-Down F1-Measure Number of Local Predictors Computed Exclusive45.8%0.4%10 Less Exclusive 53.1%2.7%10 Less Inclusive 69.9%69.8%32 Inclusive69.9% 32
09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / Predictor Results PredictorPrecisionRecall PA-SVM75.4%64.8% PFAM-SVM74.0%57.5% PST57.5%63.6% BLAST76.7%69.6% Voting76.3%73.3%
09 / 23 / Similar and Dissimilar Proteins 89% of proteins – at least one good BLAST hit Proteins which are similar (often homologous) to the set of well studied proteins 11% of proteins – no good BLAST hit Proteins which are not similar to the set of well studied proteins
09 / 23 / Coverage Coverage: Percentage of proteins for which a prediction is made OrganismGood BLAST HitNo Good BLAST Hit D. Melanogaster60%40% S. Cerevisae62%38%
09 / 23 / Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have: High Recall Minimal Size
09 / 23 / Similar Proteins – Exploiting BLAST candidate nodes generating methods: Searching outward from BLAST hit Performing the union of more than one BLAST hit’s annotations
09 / 23 / Similar Proteins – Exploiting BLAST MethodPrecisionRecall Avg Cost per Protein All77%80%1219 Top-Down77%79%111 BLAST-2-Union79%78%20 BLAST-Search-378% 221
09 / 23 / Dissimilar Proteins MethodPrecisionRecall Avg Cost per Protein BLAST19%20%1 Voting55%32%812 Top-Down Voting56%32%58 The more interesting case
09 / 23 / Comparison to Protfun On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins PrecisionRecall Protfun14%13% Voting69%29%
09 / 23 / Future Work Try other two ontologies – biological process and cellular component Use other local predictors More parameter tuning Predictor cost
09 / 23 / Conclusion Protein Function Prediction provides good leads for Protein Function Determination Hierarchical ontologies can represent incomplete data allowing the prediction of more functions Considering the hierarchy: More accurate & Less Computationally Intensive Methods presented have a higher coverage than BLAST alone Results accepted to IEEE CIBCB 2005
09 / 23 / Thanks to… Duane Szafron and Paul Lu Brett Poulin and Russ Greiner Everyone in the Proteome Analyst research group
09 / 23 / Incomplete Data & Prediction Inclusive avoids using ambiguous (incomplete) training data Does this help? To test: Train on more Incomplete Data: Choose X% of proteins, and move one annotation up Evaluation Predictions on “Complete” data
09 / 23 / Robustness to Incomplete Data
09 / 23 / Local vs Global Cross-Validation Some node predictors have as little as 20 positive examples How to do cross-validation to make sure each predictor has enough positive training examples?
09 / 23 / Local vs Global Cross-Validation Local cross-validation is invalid Predictions must be consistent Need fold isolation A single global split global cross-validation