Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.
IT-based Protein Sequence Analysis Center for Computational Biology & BIoinformatics Korea Institute of Science & Technology Information.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Assuming normally distributed data! Naïve Bayes Classifier.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein Modules An Introduction to Bioinformatics.
09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
ProReP - Protein Results Parser v3.0©
Protein and Function Databases
Protein Classification A comparison of function inference techniques.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Automatic methods for functional annotation of sequences Petri Törönen.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Modelling proteomes An integrated computational framework for systems biology research Ram Samudrala University of Washington How does the genome of an.
Protein Functional Annotation Dr G.P.S. Raghava. Annotation Methods Annotation by homology (BLAST) requires a large, well annotated database of protein.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
An Automated System for Deep Proteome Annotation Gary Van Domselaar †, Savita Shrivastava, Paul Stothard and David S. Wishart ‡ Unannotated Protein Sequence.
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
An Automated System for Deep Proteome Annotation Gary Van Domselaar September 27, 2003.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
1 (21) EZinfo Introduction. 2 (21) EZinfo  A Software that makes data analysis easy  Reveals patterns, trends, groups, outliers and complex relationships.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Savita Shrivastava Feb 25 th, 2005 Lab Presentation BASys A Web Server for Automated Bacterial Annotation.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Protein Domain Database
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.
Modelling proteomes Ram Samudrala University of Washington How does the genome of an organism specify its behaviour and characteristics?
Experiments: Three data sets : Ecoli, Yeast, Fly Evaluate each classifier using 5-fold cross validation Results: Feature selection (wrapper model) improves.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
N. Jacq – Bio informatics Tests WP n° 1 WP6-WP7-WP10 Biology applications on testbed 0 Laboratoire de Biologie des.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
A New Generation of Artificial Neural Networks.  Support Vector Machines (SVM) appeared in the early nineties in the COLT92 ACM Conference.  SVM have.
Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Machine Learning with Spark MLlib
Functional manual annotation including GO
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Hidden Markov Models (HMM)
Unsupervised Learning and Autoencoders
Machine Learning Week 1.
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein Functional Annotation
Protein Functional Annotation
The Naïve Bayes (NB) Classifier
Presentation transcript:

Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

Proteome Analyst Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell

Proteome Analyst Proteome one of many ‘-omes’ set of all proteins in an organism Analysis prediction of protein function or localization from sequence data

Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins.

Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do?

Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do? Find homologues to each protein and assume similar function.

Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do? Find homologues to each protein and assume similar function. Find characteristics of each protein that affect function.

Analyzing Proteins One Protein?

Analyzing Proteins One Protein? Just do it.

Analyzing Proteins One Protein? Just do it. 5 Proteins?

Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes.

Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins?

Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student

Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student 5000 proteins?

Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student 5000 proteins? summer students

Proteome Analyst

High-throughput Transparent Prediction of Protein Function Protein Localization Custom Classification

Machine Learning Task Training INPUT: sequences, classes OUTPUT: Classifier Analysis INPUT: sequences, Classifier OUTPUT: classes

Machine Learning Task Training INPUT: sequences, classes OUTPUT: Classifier Analysis INPUT: sequences, Classifier OUTPUT: classes, explanation

Training INPUT sequences, classes PA Tools sequences  features ML Algorithm features, classes  Classifier OUTPUT Classifier

Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI....

Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI.... classes protein sequences

Training: PA Tools sequences  features

Training: PA Tools sequences  features Homology Tools (BLAST) sequence  homologues homologues  annotations annotations  features

Homology Tool sequence  features sequence homologues annotationsfeatures seq DB BLAST retrieve parse

Homology Tool sequence  features sequence homologues annotationsfeatures seq DB BLAST retrieve parse DBSOURCE swissprot: locus MPPB_NEUCR,... xrefs (non-sequence databases):... InterProIPR001431,... KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain.

Homology Tool sequence  features sequence homologues annotationsfeatures seq DB BLAST retrieve parse

Training: PA Tools sequences  features Homology Tools (BLAST) sequence  homologues homologues  annotations annotations  features Pattern Tools (PFAM, ProSite, …) sequences  motifs motifs  features

Pattern Tool sequence  features sequence patterns features pattern DB find parse

Pattern Tool sequence  features sequence patterns features pattern DB find parse Pfam; PF00234; tryp_alpha_amyl; 1. PROSITE; PS00940; GAMMA_THIONIN; 1. PROSITE; PS00305; 11S_SEED_STORAGE; 1.

Pattern Tool sequence  features not included in current results sequence patterns features pattern DB find parse

Training: ML Algorithm features, classes  Classifier

Training: ML Algorithm features, classes  Classifier any ML Algorithm may be used default = naïve Bayes consistently near-best accuracy (SVM, ANN slightly better) efficient (for high-throughput) easy to interpret

Training: OUTPUT Classifier

Analysis (Classification) INPUT sequences PA Tools sequences  features Classifier features  classes, explanation OUTPUT classes

Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD....

Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD.... protein sequences

Analysis: PA Tools sequences  features

Analysis: PA Tools sequences  features Homology Tools (BLAST) sequence  homologues homologues  annotations annotations  features Pattern Tools (PFAM, ProSite, …) sequences  motifs motifs  features

Analysis: Classification features  classes

Analysis: Classification features  classes naïve Bayes returns probabilities of each class for each sequence efficient (for high-throughput) easy to interpret

Analysis: Classification features  classes, explanation

Analysis: Classification features  classes, explanation

Analysis: Classification features  classes, explanation

Analysis: Classification features  classes, explanation

Analysis: Classification features  classes, explanation

Results: General Function GeneQuiz classification 5-fold x-val accuracy on 14 classes

Results: General Function GeneQuiz classification 5-fold x-val accuracy on 14 classes E. Coli (2370) 82.5% Yeast (2359) 78.8% Fly (3842) 76.6%

Results: Specific Function K+ Ion Channel Proteins 5-fold x-val accuracy on 78 sequences, 4 classes

Results: Specific Function K+ Ion Channel Proteins 5-fold x-val accuracy on 78 sequences, 4 classes Accuracy 1st effort97.4% 2nd effort100%

Results: Localization Sub-cellular localization prediction 3146 sequences from 10 classes

Results: Localization Sub-cellular localization prediction 3146 sequences from 10 classes AccuracyCoverage Nair and Rost81.5%36.9% Proteome Analyst87.8%100%

Results Sub-cellular localization prediction 3146 sequences from 10 classes AccuracyCoverage Nair and Rost81.5%36.9% Proteome Analyst87.8%100%

Proteome Analyst High-throughput Transparent Prediction of Protein Function Protein Localization Custom Classification

Acknowledgements Student developers Cynthia Luk Samer Nassar Kevin McKee Biologists Warren Gallin Kathy Magor Data Nair and Rost

Acknowledgements Funding PENCE – Protein Engineering Network of Centres of Excellence NSERC - National Science and Engineering Research Council Sun Microsystems AICML - Alberta Ingenuity Centre for Machine Learning

Acknowledgements Many ‘-ome’ jokes my wife, Jen

Contact