Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1
123 Background Methods Experiments Contents 2
Background 3
>Example PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQ EFFPKFKGLTTADELKKSADVRWHAERIINAVDDAVASMDDTEKMS MKLRNLSGKHAKSFQVDPEYFKVLAAVIADTVAAGDAGFEKLMSMI 4 Definition of protein 20 different amino acids … AC D V W Y
Protein prediction related problems 5 Protein Protein structural class prediction Protein fold prediction Multi-functional enzyme prediction Protein remote homology detection Other protein-related problems, etc. Protein subcellular localization prediction
6 Common points Treat the protein-related problems as classification tasks Query protein sequence Data presentation Classification algorithms Predicted results The framework of a classification task Two major components
Methods 7
Feature extraction methods 8 Primary sequence based Secondary structure based Sequence-structure based e.g. Physicochemical features, N-gram, Functional Domain, PSSM-profile (auto-covariance), etc. e.g. Secondary sequence based, and probability matrix based e.g. Triple-sequence-structure features
Primary-sequence based 9 n-gram model Given a query protein sequence: Compute Obtain
10 A query protein sequence … … … Database sequence 1 Database sequence 2 Database sequence 3 Database sequence n-2 Database sequence n-1 Database sequence n … … … PSI-BLAST Functional protein database Feature vector Primary-sequence based Functional domain … … …
11 Position-Specific Score Matrix (PSSM) Protein database PSI-BLAST Primary-sequence based Evolution information
12 20-D features Primary-sequence based AAC features Compute Obtain
13 20*g-D features Primary-sequence based Auto-covariance (AC) transformation Compute Obtain
14 Primary-sequence based PSSM profileFrequency profile Consensus sequence Consensus sequence: A query sequence:
15 Secondary structure based Secondary structure sequence SLFEQLGGQAAVQAVTAQFYANIQAD A example of a query protein sequence : CCHEHEEEEECCCCHHHHHHEEEEECC Predicted secondary structure sequence, which has three states: PSI-PRED C (coil), H (Helix), E (strand)
16 Secondary structure based Structure state confidence matrix A example of a structure state confidence matrix: A query protein sequence Predicted structure sequence Predicted confidence
17 Secondary structure based Global structural features Compute Obtain Structure state confidence matrix:
18 Secondary structure based Local structural features ComputeObtain Structure state confidence matrix:
19 Sequence-structure based The framework of triple sequence-structure feature extraction method
20 Classification algorithms Commonly used classification algorithms e.g. Support Vector Machine (SVM), Random Forest (RF), SMO, Naive Bayes, etc. Ensemble classification algorithms e.g. Majority Vote, Average Probability, Selective Ensemble, etc.
Experiments 21
22 The framework of RF_PSCP Webserver site :
23 Datasets Three benchmark datasets Three updated large-scale datasets Sequence similarity Protein structural class prediction
24 Results Comparison with existing methods on three benchmark datasets
25 Results Tests of the proposed method on three updated large-scale datasets
26 Results Comparison with different combinations of feature subsets on three benchmark datasets
27 Results Optimization of Random forest classifier
28
Q&A ! 29