Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors
Proteome Analyst Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell
Proteome Analyst Proteome one of many ‘-omes’ set of all proteins in an organism Analysis prediction of protein function or localization from sequence data
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins.
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do?
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do? Find homologues to each protein and assume similar function.
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do? Find homologues to each protein and assume similar function. Find characteristics of each protein that affect function.
Analyzing Proteins One Protein?
Analyzing Proteins One Protein? Just do it.
Analyzing Proteins One Protein? Just do it. 5 Proteins?
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes.
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins?
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student 5000 proteins?
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student 5000 proteins? summer students
Proteome Analyst
High-throughput Transparent Prediction of Protein Function Protein Localization Custom Classification
Machine Learning Task Training INPUT: sequences, classes OUTPUT: Classifier Analysis INPUT: sequences, Classifier OUTPUT: classes
Machine Learning Task Training INPUT: sequences, classes OUTPUT: Classifier Analysis INPUT: sequences, Classifier OUTPUT: classes, explanation
Training INPUT sequences, classes PA Tools sequences features ML Algorithm features, classes Classifier OUTPUT Classifier
Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI....
Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI.... classes protein sequences
Training: PA Tools sequences features
Training: PA Tools sequences features Homology Tools (BLAST) sequence homologues homologues annotations annotations features
Homology Tool sequence features sequence homologues annotationsfeatures seq DB BLAST retrieve parse
Homology Tool sequence features sequence homologues annotationsfeatures seq DB BLAST retrieve parse DBSOURCE swissprot: locus MPPB_NEUCR,... xrefs (non-sequence databases):... InterProIPR001431,... KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain.
Homology Tool sequence features sequence homologues annotationsfeatures seq DB BLAST retrieve parse
Training: PA Tools sequences features Homology Tools (BLAST) sequence homologues homologues annotations annotations features Pattern Tools (PFAM, ProSite, …) sequences motifs motifs features
Pattern Tool sequence features sequence patterns features pattern DB find parse
Pattern Tool sequence features sequence patterns features pattern DB find parse Pfam; PF00234; tryp_alpha_amyl; 1. PROSITE; PS00940; GAMMA_THIONIN; 1. PROSITE; PS00305; 11S_SEED_STORAGE; 1.
Pattern Tool sequence features not included in current results sequence patterns features pattern DB find parse
Training: ML Algorithm features, classes Classifier
Training: ML Algorithm features, classes Classifier any ML Algorithm may be used default = naïve Bayes consistently near-best accuracy (SVM, ANN slightly better) efficient (for high-throughput) easy to interpret
Training: OUTPUT Classifier
Analysis (Classification) INPUT sequences PA Tools sequences features Classifier features classes, explanation OUTPUT classes
Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD....
Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD.... protein sequences
Analysis: PA Tools sequences features
Analysis: PA Tools sequences features Homology Tools (BLAST) sequence homologues homologues annotations annotations features Pattern Tools (PFAM, ProSite, …) sequences motifs motifs features
Analysis: Classification features classes
Analysis: Classification features classes naïve Bayes returns probabilities of each class for each sequence efficient (for high-throughput) easy to interpret
Analysis: Classification features classes, explanation
Analysis: Classification features classes, explanation
Analysis: Classification features classes, explanation
Analysis: Classification features classes, explanation
Analysis: Classification features classes, explanation
Results: General Function GeneQuiz classification 5-fold x-val accuracy on 14 classes
Results: General Function GeneQuiz classification 5-fold x-val accuracy on 14 classes E. Coli (2370) 82.5% Yeast (2359) 78.8% Fly (3842) 76.6%
Results: Specific Function K+ Ion Channel Proteins 5-fold x-val accuracy on 78 sequences, 4 classes
Results: Specific Function K+ Ion Channel Proteins 5-fold x-val accuracy on 78 sequences, 4 classes Accuracy 1st effort97.4% 2nd effort100%
Results: Localization Sub-cellular localization prediction 3146 sequences from 10 classes
Results: Localization Sub-cellular localization prediction 3146 sequences from 10 classes AccuracyCoverage Nair and Rost81.5%36.9% Proteome Analyst87.8%100%
Results Sub-cellular localization prediction 3146 sequences from 10 classes AccuracyCoverage Nair and Rost81.5%36.9% Proteome Analyst87.8%100%
Proteome Analyst High-throughput Transparent Prediction of Protein Function Protein Localization Custom Classification
Acknowledgements Student developers Cynthia Luk Samer Nassar Kevin McKee Biologists Warren Gallin Kathy Magor Data Nair and Rost
Acknowledgements Funding PENCE – Protein Engineering Network of Centres of Excellence NSERC - National Science and Engineering Research Council Sun Microsystems AICML - Alberta Ingenuity Centre for Machine Learning
Acknowledgements Many ‘-ome’ jokes my wife, Jen
Contact