Download presentation
Presentation is loading. Please wait.
1
Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors
2
Proteome Analyst Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell
3
Proteome Analyst Proteome one of many ‘-omes’ set of all proteins in an organism Analysis prediction of protein function or localization from sequence data
4
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins.
5
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do?
6
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do? Find homologues to each protein and assume similar function.
7
Analyze a Protein We have examples of annotated proteins in various protein classes. We have more examples of unannotated proteins. What do we do? Find homologues to each protein and assume similar function. Find characteristics of each protein that affect function.
8
Analyzing Proteins One Protein?
9
Analyzing Proteins One Protein? Just do it.
10
Analyzing Proteins One Protein? Just do it. 5 Proteins?
11
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes.
12
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins?
13
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student
14
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student 5000 proteins?
15
Analyzing Proteins One Protein? Just do it. 5 Proteins? Post-doc familiar with protein classes. 50 Proteins? grad student 5000 proteins? summer students
16
Proteome Analyst
17
High-throughput Transparent Prediction of Protein Function Protein Localization Custom Classification
18
Machine Learning Task Training INPUT: sequences, classes OUTPUT: Classifier Analysis INPUT: sequences, Classifier OUTPUT: classes
19
Machine Learning Task Training INPUT: sequences, classes OUTPUT: Classifier Analysis INPUT: sequences, Classifier OUTPUT: classes, explanation
20
Training INPUT sequences, classes PA Tools sequences features ML Algorithm features, classes Classifier OUTPUT Classifier
21
Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI....
22
Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI.... classes protein sequences
23
Training: PA Tools sequences features
24
Training: PA Tools sequences features Homology Tools (BLAST) sequence homologues homologues annotations annotations features
25
Homology Tool sequence features sequence homologues annotationsfeatures seq DB BLAST retrieve parse
26
Homology Tool sequence features sequence homologues annotationsfeatures seq DB BLAST retrieve parse DBSOURCE swissprot: locus MPPB_NEUCR,... xrefs (non-sequence databases):... InterProIPR001431,... KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain.
27
Homology Tool sequence features sequence homologues annotationsfeatures seq DB BLAST retrieve parse
28
Training: PA Tools sequences features Homology Tools (BLAST) sequence homologues homologues annotations annotations features Pattern Tools (PFAM, ProSite, …) sequences motifs motifs features
29
Pattern Tool sequence features sequence patterns features pattern DB find parse
30
Pattern Tool sequence features sequence patterns features pattern DB find parse Pfam; PF00234; tryp_alpha_amyl; 1. PROSITE; PS00940; GAMMA_THIONIN; 1. PROSITE; PS00305; 11S_SEED_STORAGE; 1.
31
Pattern Tool sequence features not included in current results sequence patterns features pattern DB find parse
32
Training: ML Algorithm features, classes Classifier
33
Training: ML Algorithm features, classes Classifier any ML Algorithm may be used default = naïve Bayes consistently near-best accuracy (SVM, ANN slightly better) efficient (for high-throughput) easy to interpret
34
Training: OUTPUT Classifier
35
Analysis (Classification) INPUT sequences PA Tools sequences features Classifier features classes, explanation OUTPUT classes
36
Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD....
37
Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD.... protein sequences
38
Analysis: PA Tools sequences features
39
Analysis: PA Tools sequences features Homology Tools (BLAST) sequence homologues homologues annotations annotations features Pattern Tools (PFAM, ProSite, …) sequences motifs motifs features
40
Analysis: Classification features classes
41
Analysis: Classification features classes naïve Bayes returns probabilities of each class for each sequence efficient (for high-throughput) easy to interpret
42
Analysis: Classification features classes, explanation
43
Analysis: Classification features classes, explanation
44
Analysis: Classification features classes, explanation
45
Analysis: Classification features classes, explanation
46
Analysis: Classification features classes, explanation
47
Results: General Function GeneQuiz classification 5-fold x-val accuracy on 14 classes
48
Results: General Function GeneQuiz classification 5-fold x-val accuracy on 14 classes E. Coli (2370) 82.5% Yeast (2359) 78.8% Fly (3842) 76.6%
49
Results: Specific Function K+ Ion Channel Proteins 5-fold x-val accuracy on 78 sequences, 4 classes
50
Results: Specific Function K+ Ion Channel Proteins 5-fold x-val accuracy on 78 sequences, 4 classes Accuracy 1st effort97.4% 2nd effort100%
51
Results: Localization Sub-cellular localization prediction 3146 sequences from 10 classes
52
Results: Localization Sub-cellular localization prediction 3146 sequences from 10 classes AccuracyCoverage Nair and Rost81.5%36.9% Proteome Analyst87.8%100%
53
Results Sub-cellular localization prediction 3146 sequences from 10 classes AccuracyCoverage Nair and Rost81.5%36.9% Proteome Analyst87.8%100%
54
Proteome Analyst High-throughput Transparent Prediction of Protein Function Protein Localization Custom Classification
55
Acknowledgements Student developers Cynthia Luk Samer Nassar Kevin McKee Biologists Warren Gallin Kathy Magor Data Nair and Rost
56
Acknowledgements Funding PENCE – Protein Engineering Network of Centres of Excellence NSERC - National Science and Engineering Research Council Sun Microsystems AICML - Alberta Ingenuity Centre for Machine Learning
57
Acknowledgements Many ‘-ome’ jokes my wife, Jen
58
Contact http://www.cs.ualberta.ca/~bioinfo/PA poulin@cs.ualberta.ca
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.