Download presentation
Presentation is loading. Please wait.
Published byKathryn Tyler Modified over 9 years ago
1
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009
2
Outline Ubiquitination Machine Learning Decision Tree Support Vector Machines Prediction of ubiquitination sites Influence of sequence Influence of structure Influence of evolutionary consideration
3
Ubiquitin A small protein that occurs in all eukaryotic cells. Highly conserved among eukaryotic species. Consists of 76 amino acids and has a molecular mass of 8.5 kDa. Key features its C-terminal tail and Lys residues Human ubiquitin sequence MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIF AGKQLEDGRTLSDYNIQKESTLHLVLRLRGG http://en.wikipedia.org/wiki/Image:Ubiquitin_cartoon.png
4
Ubiquitination Post-translational modification of a protein Covalent attachment of one or more ubiquitin monomers to Lys residues Reversible Target proteins for degradation by the proteasome
5
Functions of Ubiquitination Monoubiquitination Histone regulation DNA repair Endocytosis Budding of retroviruses from the plasma membrane Polyubiquitination Protein kinase activation
6
Machine Learning Machine learning is programming computers to optimize a performance criterion using data and past experience. Learn general models from a data set of particular examples. Build a model that is a good and useful approximation to the data.
7
Machine Learning Supervised learning Learn input/output patterns from given, correct output. Split data into training and test set. Train element on training data. Evaluate performance on test data. Unsupervised learning Learn input/output patterns without known output.
8
Machine Learning – Decision Tree One of classification algorithms Each internal node tests the value of a feature and branches according to the results of the test. Each leaf node assigns a classification. XY OutlookHumidityWindPlay SunnyHighWeakNo SunnyNormalWeakYes OvercastNormalWeakYes RainNormalStrongNo RainNormalWeakYes
9
Machine Learning – Random Forest A machine learning ensemble classifier Consists of many decision trees Each tree is constructed using a bootstrap sample of training data. After a large number of trees are generated, each tree casts a unit vote for the most popular class.
10
Machine Learning – Support Vector Machines Viewing input data as two sets of vectors in an n-dimensional space, an support vector machine will construct a separating hyperplane in that space. The hyperplane maximizes the margin between the two data sets.
11
Machine Learning – Support Vector Machines If a data set is not linearly separable, map into a higher-dimensional space using kernel approach. H 3 does not separate the classes. H 1 separates with a small margin. H 2 separates with the maximum margin.
12
Data Sets for Prediction 334 protein sequences from yeast Positive and negative sites with 25 amino acid residues centered at lysine Remove all positive and negative sites that have more than 40% identity inside the data sets. YEYEYDQTDPVAKDPYNPYYLDFAS
13
Features – Sequence Information Relative amino acid frequencies, Entropy, Net charge, Total charge, Aromatics, Charge-hydrophobicity ratio, Protein disorder probability, Vihinen's flexibility, Hydrophobic moments, B-factors 64 X 4 = 256 features Relative amino acid frequencies Window size = 11 A = 1/11G = 0/11M = 0/11S = 0/11 C = 0/11H = 0/11N = 1/11T = 1/11 D = 2/11 I = 0/11P = 3/11V = 1/11 E = 0/11K = 1/11Q = 0/11W = 0/11 F = 0/11L = 0/11R =0/11Y = 1/11 YEYEYDQTDPVAKDPYNPYYLDFAS
14
Features – Evolutionary Information Position Specific Scoring Matrix 21 X 4 = 84 features Window size = 11 256(Seq) + 84(Evol) = 340 features. YEYEYDQTDPVAKDPYNPYYLDFAS
15
Features – Structure Information BLAST each sequence against PDB database. Select alignments with greater than 30% identity. For each mapped site, five shells with 1.5, 3, 4.5, 6, 7,5Å radial boundaries are constructed from the residue’s alpha-carbon atom using X, Y, Z coordinates from PDB. Amino acid at the center site 20 features e.g. K A C D E F G H I K L M N P Q R S T V W Y 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Each shell contains 24 features. 4 for counts of C, N, O, S and 20 for counts of amino acids 20 + 24 x 5 = 140 features 60 sites among 245 positive sites ~24% 3239 sites among 12906 negative sites ~25% 1X140 zero vector for the other sites 256(Seq) + 84(Evol) + 140(Str) = 480 features
16
Prediction Results – Random Forest FeaturesAccuracyAUC Seq + Evol + Str65.2 +/- 22.871.5 +/- 25.3 Seq + Evol63.9 +/- 23.469.8 +/- 24.9 Seq + Str66.2 +/- 22.370.6 +/- 24.3 Evol + Str56.7 +/- 23.159.2 +/- 27.7 Seq64.6 +/- 22.470.1 +/- 24.3
17
Prediction Results – Random Forest
18
Prediction Results – SVM FeaturesAccuracyAUC Seq + Evol + Str63.8 +/- 23.271.2 +/- 25.3 Seq + Evol63.7 +/- 23.471.0 +/- 25.4 Seq + Str65.1 +/- 23.571.3 +/- 25.0 Evol + Str56.6 +/- 22.259.9 +/- 28.0 Seq65.8 +/- 23.071.2 +/- 25.0
19
Prediction Results – SVM
20
Feature Selection Rank features using correlation coefficients ( ). Feature -0.0790Net charge -0.0585K frequency 0.0530E frequency 0.0513D frequency 0.0481Predicted B-factor 0.0471Protein disorder 0.0448Vihinen's flexibility -0.0387Hydrophobic moment -0.0383L frequency
21
Conclusions Ubiquitination sites are predictable. The accuracy is modest. Long range interactions Flexibility of structure Noise in positive sites Small data set The sequence features are the most important.
22
Acknowledgements Prof. Predrag Radivojac Wyatt Clark Arunima Ram Nils Schimmelmann Prof. Sun Kim Linda Hostetter School of Informatics
23
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.