09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

Slides:



Advertisements
Similar presentations
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Advertisements

Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
BIOINFORMATICS Ency Lee.
Semantic Similarity over the Gene Ontology F. M. Couto, M. J. Silva, P. M. Coutinho Family Correlation and Selecting Disjunctive Ancestors
Structural bioinformatics
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Heuristic alignment algorithms and cost matrices
Profile-profile alignment using hidden Markov models Wing Wong.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Internet tools for genomic analysis: part 2
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Sequence alignment, E-value & Extreme value distribution
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
Protein Tertiary Structure Prediction
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Protein Sequence Alignment and Database Searching.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Sequence Alignment.
Data Mining and Decision Support
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
Experiments: Three data sets : Ecoli, Yeast, Fly Evaluate each classifier using 5-fold cross validation Results: Feature selection (wrapper model) improves.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Copyright OpenHelix. No use or reproduction without express written consent1.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Protein Structures.
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Sequence alignment, E-value & Extreme value distribution
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu

09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 /

09 / 23 / Proteins Functional Units in the cell Perform a Variety of Functions  e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules Can take years to study a single protein  Any good leads would be helpful!

09 / 23 / Protein Function Prediction and Protein Function Determination Prediction:  An estimate of what function a protein performs Determination:  Work in a laboratory to observe and discover what function a protein performs Prediction complements determination

09 / 23 / Proteins Chain of amino acids  20 Amino Acids FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI

09 / 23 / Ontologies Standardized Vocabularies (Common Language) In biological literature, different terms can be used to describe the same function  e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” Can be structured in a hierarchy to show relationships

09 / 23 / Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:  Molecular Function  Biological Process  Cellular Component

09 / 23 / Gene Ontology Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:  Molecular Function  Biological Process  Cellular Component

09 / 23 / Hierarchical Ontologies Can help to represent a large number of classes Represent General and Specific data Some data is incomplete – could become more specific in the future

09 / 23 / Incomplete Annotations

09 / 23 / Goal To predict the function of proteins given their sequence

09 / 23 / Data Set Protein Sequences  UniProt database Ontology  Gene Ontology Molecular Function aspect Experimental Annotations  Gene Ontology Annotation EBI Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins Final Data Set: 14,362 proteins

09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / Predictors Global:  BLAST NN Local:  PA-SVM  PFAM-SVM  Probabilistic Suffix Trees

09 / 23 / Predictors Global:  BLAST NN Local:  PA-SVM  PFAM-SVM  Probabilistic Suffix Trees Linear

09 / 23 / Why Linear SVMs? Accurate Explainability Each term in the dot product in meaningful

09 / 23 / PA-SVM Proteome Analyst

09 / 23 / PFAM-SVM Hidden Markov Models

09 / 23 / PST Probabilistic Suffix Trees  Efficient Markov chains Model the protein sequences directly: Prediction:

09 / 23 / BLAST Protein Sequence Alignment for a query protein against any set of protein sequences

09 / 23 / BLAST

09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / Evaluating Predictions in a Hierarchy Not all errors are equivalent  Error to sibling different than error to unrelated part of hierarchy Proteins can perform more than one function  Need to combine predictions of multiple functions into a single measure

09 / 23 / Evaluating Predictions in a Hierarchy Semantics of the hierarchy – True Path Rule Protein labeled with: {T} -> {T, A 1, A 2 } Predicted functions: {S} -> {S, A 1, A 2 } Precision = 2/3 = 67% Recall = 2/3 = 67%

09 / 23 / Evaluating Predictions in a Hierarchy Protein labelled with {T} -> {T, A 1, A 2 } Predicted: {C 1 } -> {C 1, T, A 1, A 2 } Precision = 3/4 = 75% Recall = 3/3 = 100%

09 / 23 / Supervised Learning

09 / 23 / Cross-Validation Used to estimate performance of classification system on future data 5 Fold Cross- Validation:

09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / Inclusive vs Exclusive Local Predictors In a system of local predictors, how should each local predictor behave? Two extremes:  A local predictor predicts positive only for those proteins that belong exactly at that node  A local predictor predicts positive for those proteins that belong at or below them in the hierarchy No a priori reason to choose either

09 / 23 / Exclusive Local Predictors

09 / 23 / Inclusive Local Predictors

09 / 23 / Training Set Design Proteins in the current fold’s training set can be used in any way Need to select for each local predictor:  Positive training examples  Negative training examples

09 / 23 / Training Set Design

09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / Training Set Design Positive Examples Negative Examples Exclusive TNot [T] Less Exclusive TNot [ T U Descendants(T)] Less Inclusive T U Descendants(T) Not [ T U Descendants(T)] Inclusive T U Descendants(T) Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / Comparing Training Set Design Schemes Using PA-SVM MethodPrecisionRecallF1-Measure Exceptions per Protein Exclusive75.8%32.8%45.8%1.52 Less Exclusive 77.7%40.4%53.1%1.74 Less Inclusive 77.3%63.8%69.9%0.05 Inclusive75.3%65.2%69.9%0.09

09 / 23 / Exclusive have more exceptions

09 / 23 / Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / Lowering the Cost of Local Predictors Top-Down  Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / Top-Down Search Method Previous F1-Measure Top-Down F1-Measure Number of Local Predictors Computed Exclusive45.8%0.4%10 Less Exclusive 53.1%2.7%10 Less Inclusive 69.9%69.8%32 Inclusive69.9% 32

09 / 23 / Outline Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / Predictor Results PredictorPrecisionRecall PA-SVM75.4%64.8% PFAM-SVM74.0%57.5% PST57.5%63.6% BLAST76.7%69.6% Voting76.3%73.3%

09 / 23 / Similar and Dissimilar Proteins 89% of proteins – at least one good BLAST hit  Proteins which are similar (often homologous) to the set of well studied proteins 11% of proteins – no good BLAST hit  Proteins which are not similar to the set of well studied proteins

09 / 23 / Coverage Coverage: Percentage of proteins for which a prediction is made OrganismGood BLAST HitNo Good BLAST Hit D. Melanogaster60%40% S. Cerevisae62%38%

09 / 23 / Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found  Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have:  High Recall  Minimal Size

09 / 23 / Similar Proteins – Exploiting BLAST candidate nodes generating methods:  Searching outward from BLAST hit  Performing the union of more than one BLAST hit’s annotations

09 / 23 / Similar Proteins – Exploiting BLAST MethodPrecisionRecall Avg Cost per Protein All77%80%1219 Top-Down77%79%111 BLAST-2-Union79%78%20 BLAST-Search-378% 221

09 / 23 / Dissimilar Proteins MethodPrecisionRecall Avg Cost per Protein BLAST19%20%1 Voting55%32%812 Top-Down Voting56%32%58 The more interesting case

09 / 23 / Comparison to Protfun On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins PrecisionRecall Protfun14%13% Voting69%29%

09 / 23 / Future Work Try other two ontologies – biological process and cellular component Use other local predictors More parameter tuning Predictor cost

09 / 23 / Conclusion Protein Function Prediction provides good leads for Protein Function Determination Hierarchical ontologies can represent incomplete data allowing the prediction of more functions Considering the hierarchy:  More accurate & Less Computationally Intensive Methods presented have a higher coverage than BLAST alone Results accepted to IEEE CIBCB 2005

09 / 23 / Thanks to… Duane Szafron and Paul Lu Brett Poulin and Russ Greiner Everyone in the Proteome Analyst research group

09 / 23 / Incomplete Data & Prediction Inclusive avoids using ambiguous (incomplete) training data Does this help? To test:  Train on more Incomplete Data: Choose X% of proteins, and move one annotation up  Evaluation Predictions on “Complete” data

09 / 23 / Robustness to Incomplete Data

09 / 23 / Local vs Global Cross-Validation Some node predictors have as little as 20 positive examples How to do cross-validation to make sure each predictor has enough positive training examples?

09 / 23 / Local vs Global Cross-Validation Local cross-validation is invalid  Predictions must be consistent  Need fold isolation A single global split  global cross-validation