Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

Similar presentations


Presentation on theme: "Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009."— Presentation transcript:

1 Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009

2 Outline  Ubiquitination  Machine Learning Decision Tree Support Vector Machines  Prediction of ubiquitination sites Influence of sequence Influence of structure Influence of evolutionary consideration

3 Ubiquitin  A small protein that occurs in all eukaryotic cells.  Highly conserved among eukaryotic species.  Consists of 76 amino acids and has a molecular mass of 8.5 kDa.  Key features its C-terminal tail and Lys residues  Human ubiquitin sequence MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIF AGKQLEDGRTLSDYNIQKESTLHLVLRLRGG http://en.wikipedia.org/wiki/Image:Ubiquitin_cartoon.png

4 Ubiquitination  Post-translational modification of a protein  Covalent attachment of one or more ubiquitin monomers to Lys residues  Reversible  Target proteins for degradation by the proteasome

5 Functions of Ubiquitination  Monoubiquitination Histone regulation DNA repair Endocytosis Budding of retroviruses from the plasma membrane  Polyubiquitination Protein kinase activation

6 Machine Learning  Machine learning is programming computers to optimize a performance criterion using data and past experience. Learn general models from a data set of particular examples. Build a model that is a good and useful approximation to the data.

7 Machine Learning  Supervised learning Learn input/output patterns from given, correct output. Split data into training and test set.  Train element on training data.  Evaluate performance on test data.  Unsupervised learning Learn input/output patterns without known output.

8 Machine Learning – Decision Tree  One of classification algorithms  Each internal node tests the value of a feature and branches according to the results of the test.  Each leaf node assigns a classification. XY OutlookHumidityWindPlay SunnyHighWeakNo SunnyNormalWeakYes OvercastNormalWeakYes RainNormalStrongNo RainNormalWeakYes

9 Machine Learning – Random Forest  A machine learning ensemble classifier  Consists of many decision trees  Each tree is constructed using a bootstrap sample of training data.  After a large number of trees are generated, each tree casts a unit vote for the most popular class.

10 Machine Learning – Support Vector Machines  Viewing input data as two sets of vectors in an n-dimensional space, an support vector machine will construct a separating hyperplane in that space.  The hyperplane maximizes the margin between the two data sets.

11 Machine Learning – Support Vector Machines  If a data set is not linearly separable, map into a higher-dimensional space using kernel approach.  H 3 does not separate the classes.  H 1 separates with a small margin.  H 2 separates with the maximum margin.

12 Data Sets for Prediction  334 protein sequences from yeast  Positive and negative sites with 25 amino acid residues centered at lysine  Remove all positive and negative sites that have more than 40% identity inside the data sets. YEYEYDQTDPVAKDPYNPYYLDFAS

13 Features – Sequence Information  Relative amino acid frequencies, Entropy, Net charge, Total charge, Aromatics, Charge-hydrophobicity ratio, Protein disorder probability, Vihinen's flexibility, Hydrophobic moments, B-factors  64 X 4 = 256 features  Relative amino acid frequencies Window size = 11 A = 1/11G = 0/11M = 0/11S = 0/11 C = 0/11H = 0/11N = 1/11T = 1/11 D = 2/11 I = 0/11P = 3/11V = 1/11 E = 0/11K = 1/11Q = 0/11W = 0/11 F = 0/11L = 0/11R =0/11Y = 1/11 YEYEYDQTDPVAKDPYNPYYLDFAS

14 Features – Evolutionary Information  Position Specific Scoring Matrix  21 X 4 = 84 features Window size = 11  256(Seq) + 84(Evol) = 340 features. YEYEYDQTDPVAKDPYNPYYLDFAS

15 Features – Structure Information  BLAST each sequence against PDB database.  Select alignments with greater than 30% identity.  For each mapped site, five shells with 1.5, 3, 4.5, 6, 7,5Å radial boundaries are constructed from the residue’s alpha-carbon atom using X, Y, Z coordinates from PDB. Amino acid at the center site  20 features e.g. K  A C D E F G H I K L M N P Q R S T V W Y 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Each shell contains 24 features. 4 for counts of C, N, O, S and 20 for counts of amino acids 20 + 24 x 5 = 140 features  60 sites among 245 positive sites  ~24% 3239 sites among 12906 negative sites  ~25%  1X140 zero vector for the other sites  256(Seq) + 84(Evol) + 140(Str) = 480 features

16 Prediction Results – Random Forest FeaturesAccuracyAUC Seq + Evol + Str65.2 +/- 22.871.5 +/- 25.3 Seq + Evol63.9 +/- 23.469.8 +/- 24.9 Seq + Str66.2 +/- 22.370.6 +/- 24.3 Evol + Str56.7 +/- 23.159.2 +/- 27.7 Seq64.6 +/- 22.470.1 +/- 24.3

17 Prediction Results – Random Forest

18 Prediction Results – SVM FeaturesAccuracyAUC Seq + Evol + Str63.8 +/- 23.271.2 +/- 25.3 Seq + Evol63.7 +/- 23.471.0 +/- 25.4 Seq + Str65.1 +/- 23.571.3 +/- 25.0 Evol + Str56.6 +/- 22.259.9 +/- 28.0 Seq65.8 +/- 23.071.2 +/- 25.0

19 Prediction Results – SVM

20 Feature Selection  Rank features using correlation coefficients (  ).  Feature -0.0790Net charge -0.0585K frequency 0.0530E frequency 0.0513D frequency 0.0481Predicted B-factor 0.0471Protein disorder 0.0448Vihinen's flexibility -0.0387Hydrophobic moment -0.0383L frequency

21 Conclusions  Ubiquitination sites are predictable.  The accuracy is modest. Long range interactions Flexibility of structure Noise in positive sites Small data set  The sequence features are the most important.

22 Acknowledgements  Prof. Predrag Radivojac  Wyatt Clark  Arunima Ram  Nils Schimmelmann  Prof. Sun Kim  Linda Hostetter  School of Informatics

23 Thank you!


Download ppt "Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009."

Similar presentations


Ads by Google