Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac
Overview Background Previous Work Methods Results
Central Dogma Background Previous Work Methods Results
Post-Translational Modifications (PTMs) Background Previous Work Methods Results
Acetylation Acetylation involves the substitution of an acetyl group (-COCH3) for hydrogen Typically occurs on N-terminal tails and lysine residues (Lys or K) Background Previous Work Methods Results
Previous Predictors Several PTM predictors have been created prior to this work There are also acetylation predictors prior NetAcet is a predictor for only N-terminal sites AutoMotif Server is a predictor for various PTMs and includes an acetylation portion PAIL is a lysine acetylation predictor Background Previous Work Methods Results
Methods Create Dataset Download articles relevant to acetylation and extract sites Rank articles in order to elucidate sites quickly SwissProt and Human Protein Reference Database (HPRD) Create Predictors Leave – one – protein – out validation Matlab Background Previous Work Methods Results
Article Retrieval Searched individual journal sites for articles relevant to acetylation Saved resultant html pages for each journal These pages were then used as the input for a web crawler to download articles Due to varying journal site construction each journal required a unique regular expression to extract links for articles Background Previous Work Methods Results
Rank Articles First locate occurrences of first phrase: “phrase 1” A = {a 1, a 2, …, a |A | } Next locate occurrences of second phrase: “phrase 2” R = {r 1, r 2 …, r |R| } c and d are constants x is the distance in characters between r and the nearest word a Background Previous Work Methods Results
An example: acetylation Background Previous Work Methods Results 1. word “acetylat” A = {a 1, a 2, …, a m } 2. regular expression (k lys lysine)(space) * (digit) + R = {r 1, r 2, …, r n }
An example: acetylation Background Previous Work Methods Results Score for article S: and where
An example: acetylation Score for article S: where: and Papers with S > 100 are rich in sites; if S < 30 “twilight” zone Background Previous Work Methods Results
Elucidate Sites Sites were manually extracted from articles beginning with the highest rank The original experimental paper for these sites was verified for traceable evidence Sites were extracted from SwissProt Sites were extracted from HPRD Background Previous Work Methods Results
Predictors Support Vector Machine Artificial Neural Network Decision Tree Background Previous Work Methods Results
Predictor Input Positives taken as all lysines found to be acetylated Negatives taken as all lysines not found to be acetylated Features created based on characteristics surrounding lysines Amino acid content, hydrophobicity, charge, disorder, etc. Background Previous Work Methods Results
Predictor Input Background Previous Work Methods Results Protein Features Acetylated
Article and Ranking Results 4888 articles from 10 sites were searched Nature provided 2147 articles Science Direct provided1519 articles The highest ranking article was obtained from the Journal of Biological Chemistry Score of Contained 10 acetylation sites The highest ranking article was obtained from Nature when histones are excluded Previously ranked at #5 score of Contained 9 unique acetylation sites Background Previous Work Methods Results
Top 25 RankScoreSitesArticle Source 1) Journal of Biological Chemistry 2) Cell / Science Direct 3) Nature 4) Journal of Proteome Research 5) Nature 6) Biochemistry 7) Cell / Science Direct 8) Nature 9) Molecular Cell / Science Direct 10) Journal of Biological Chemistry 11) Biochemistry 12) Journal of Biological Chemistry 13) International Journal of Mass Spectrometry / Science Direct 14) Biochemistry 15) Journal of Biological Chemistry 16) Nucleic Acids Research 17) Biochemistry 18) Molecular Cell / Science Direct 19) Journal of Biological Chemistry 20) Nature 21) Molecular Cell / Science Direct 22) Cell / Science Direct 23) Nucleic Acids Research 24) Gene / Science Direct 25) Journal of the American Society for Mass Spectrometry Background Previous Work Methods Results
Ranking Results Articles with scores greater than 30 had potential for providing at least one site As scores approached 30, articles became less fruitful Background Previous Work Methods Results
Dataset Results Dataset included 1442 total sites and 1085 non- redundant sites HPRD contributed 90 total sites Swiss-Prot contributed 825 Our Study contributed 527 Background Previous Work Methods Results
Background Previous Work Methods Results Dataset Results
Sensitivity, Specificity, and Precision Sensitivity(sn) - Specificity(sp) - Precision(pr) - Background Previous Work Methods Results
Accuracy and AUC Accuracy(acc) - Area Under Curve(AUC) Refers to the area under the Receiver Operating Curve (ROC) ROC is the graphical plot of sensitivity vs. 1-specificity Background Previous Work Methods Results
SVM Predictor Degree Polynomial kernel snsppraccAUC p = p = p = Degree Gaussian kernel snsppraccAUC σ = σ = σ = Background Previous Work Methods Results
Artificial Neural Network Hidden Neurons Artificial Neural Network snsppraccAUC Background Previous Work Methods Results
Decision Tree Algorithm Decision Tree snsppraccAUC Decision Tree Background Previous Work Methods Results
Algorithm Comparison AlgorithmsnsppraccAUC SVM Neural Network Decision Tree Background Previous Work Methods Results
I would like to acknowledge those who have helped me throughout the duration of this project, Dr. Predrag Radivojac, Dr. Haixu Tang, and Wyatt Clark
I welcome your questions and/or comments
An example: acetylation 1. word “acetylat” A = {a 1, a 2, …, a m } 2. regular expression (k lys lysine)(space) * (digit) + R = {r 1, r 2, …, r n } Background Previous Work Methods Results
An example: acetylation Background Previous Work Methods Results Score for article S: and where