Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
MS-Viewer – A Web Based Spectral Viewer For Database Search Results Peter R. Baker 1, Alma L. Burlingame 1 and Robert J. Chalkley 1 1 Mass Spectrometry.
Protein Backbone Angle Prediction with Machine Learning Approaches by R Kang, C Leslie, & A Yang in Bioinformatics, 1 July 2004, vol 20 nbr 10 pp
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
ORGANIC SEARCH. CRAWL INDEX RANK CRAWLING Known Web pages Index Servers Crawler Machines Crawler Machines Googlebot Doc Servers.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Diagnosis of Ovarian Cancer Based on Mass Spectrum of Blood Samples Committee: Eugene Fink Lihua Li Dmitry B. Goldgof Hong Tang.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
A Neural Network Predictor for Peptide Fragmentation in Mass Spectrometry Arunima Ram Advisor : Dr. Predrag Radivojac Co-Advisor : Dr. Haixu Tang Co-Advisor.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
The dynamic nature of the proteome
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
An Example of Course Project Face Identification.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
PROTEINS Nicky Mulder Acknowledgements: Anna Kramvis for lecture material (adapted here)
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Top X interactions of PIN Network A interactions Coverage of Network A Figure S1 - Network A interactions are distributed evenly across the top 60,000.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
An overview of Bioinformatics. Cell and Central Dogma.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY PEDRO ALVES Advisor: Predrag Radivojac School of Informatics BLOOMINGTON.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Sigma-aldrich.com/cellsignaling DNA Compaction into Chromosomes.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.
By Jay Krishnan. Introduction Information gathered from Proteomic techniques + neuroscientific research = Information on protein composition and function.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
2014 Using machine learning to predict binding sites in proteins Jenelle Bray Stanford University October 10, 2014 #GHC
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
David Amar, Tom Hait, and Ron Shamir
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Source: Procedia Computer Science(2015)70:
Deep Web Mining and Learning for Advanced Local Search
Prediction of Optimal Cancer Drug therapies via SVM
Introduction Feature Extraction Discussions Conclusions Results
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
Protein Disorder Prediction
Comparison of ROC plots for the PMF quality metrics using test dataset 2 (44 C. difficile proteins).a, ROC curves for coverage (open squares), MC (solid.
Comparison of ROC plots for the PMF quality metrics using test dataset 3 (100 M. jannaschii proteins).a, ROC curves for coverage (open squares), MC (solid.
iRNA-PseU: Identifying RNA pseudouridine sites
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac

Overview Background Previous Work Methods Results

Central Dogma Background Previous Work Methods Results

Post-Translational Modifications (PTMs) Background Previous Work Methods Results

Acetylation Acetylation involves the substitution of an acetyl group (-COCH3) for hydrogen Typically occurs on N-terminal tails and lysine residues (Lys or K) Background Previous Work Methods Results

Previous Predictors Several PTM predictors have been created prior to this work There are also acetylation predictors prior NetAcet is a predictor for only N-terminal sites AutoMotif Server is a predictor for various PTMs and includes an acetylation portion PAIL is a lysine acetylation predictor Background Previous Work Methods Results

Methods Create Dataset Download articles relevant to acetylation and extract sites Rank articles in order to elucidate sites quickly SwissProt and Human Protein Reference Database (HPRD) Create Predictors Leave – one – protein – out validation Matlab Background Previous Work Methods Results

Article Retrieval Searched individual journal sites for articles relevant to acetylation Saved resultant html pages for each journal These pages were then used as the input for a web crawler to download articles Due to varying journal site construction each journal required a unique regular expression to extract links for articles Background Previous Work Methods Results

Rank Articles First locate occurrences of first phrase: “phrase 1” A = {a 1, a 2, …, a |A | } Next locate occurrences of second phrase: “phrase 2” R = {r 1, r 2 …, r |R| } c and d are constants x is the distance in characters between r and the nearest word a Background Previous Work Methods Results

An example: acetylation Background Previous Work Methods Results 1. word “acetylat” A = {a 1, a 2, …, a m } 2. regular expression (k  lys  lysine)(space) * (digit) + R = {r 1, r 2, …, r n }

An example: acetylation Background Previous Work Methods Results Score for article S: and where

An example: acetylation Score for article S: where: and Papers with S > 100 are rich in sites; if S < 30 “twilight” zone Background Previous Work Methods Results

Elucidate Sites Sites were manually extracted from articles beginning with the highest rank The original experimental paper for these sites was verified for traceable evidence Sites were extracted from SwissProt Sites were extracted from HPRD Background Previous Work Methods Results

Predictors Support Vector Machine Artificial Neural Network Decision Tree Background Previous Work Methods Results

Predictor Input Positives taken as all lysines found to be acetylated Negatives taken as all lysines not found to be acetylated Features created based on characteristics surrounding lysines Amino acid content, hydrophobicity, charge, disorder, etc. Background Previous Work Methods Results

Predictor Input Background Previous Work Methods Results Protein Features Acetylated

Article and Ranking Results 4888 articles from 10 sites were searched Nature provided 2147 articles Science Direct provided1519 articles The highest ranking article was obtained from the Journal of Biological Chemistry Score of Contained 10 acetylation sites The highest ranking article was obtained from Nature when histones are excluded Previously ranked at #5 score of Contained 9 unique acetylation sites Background Previous Work Methods Results

Top 25 RankScoreSitesArticle Source 1) Journal of Biological Chemistry 2) Cell / Science Direct 3) Nature 4) Journal of Proteome Research 5) Nature 6) Biochemistry 7) Cell / Science Direct 8) Nature 9) Molecular Cell / Science Direct 10) Journal of Biological Chemistry 11) Biochemistry 12) Journal of Biological Chemistry 13) International Journal of Mass Spectrometry / Science Direct 14) Biochemistry 15) Journal of Biological Chemistry 16) Nucleic Acids Research 17) Biochemistry 18) Molecular Cell / Science Direct 19) Journal of Biological Chemistry 20) Nature 21) Molecular Cell / Science Direct 22) Cell / Science Direct 23) Nucleic Acids Research 24) Gene / Science Direct 25) Journal of the American Society for Mass Spectrometry Background Previous Work Methods Results

Ranking Results Articles with scores greater than 30 had potential for providing at least one site As scores approached 30, articles became less fruitful Background Previous Work Methods Results

Dataset Results Dataset included 1442 total sites and 1085 non- redundant sites HPRD contributed 90 total sites Swiss-Prot contributed 825 Our Study contributed 527 Background Previous Work Methods Results

Background Previous Work Methods Results Dataset Results

Sensitivity, Specificity, and Precision Sensitivity(sn) - Specificity(sp) - Precision(pr) - Background Previous Work Methods Results

Accuracy and AUC Accuracy(acc) - Area Under Curve(AUC) Refers to the area under the Receiver Operating Curve (ROC) ROC is the graphical plot of sensitivity vs. 1-specificity Background Previous Work Methods Results

SVM Predictor Degree Polynomial kernel snsppraccAUC p = p = p = Degree Gaussian kernel snsppraccAUC σ = σ = σ = Background Previous Work Methods Results

Artificial Neural Network Hidden Neurons Artificial Neural Network snsppraccAUC Background Previous Work Methods Results

Decision Tree Algorithm Decision Tree snsppraccAUC Decision Tree Background Previous Work Methods Results

Algorithm Comparison AlgorithmsnsppraccAUC SVM Neural Network Decision Tree Background Previous Work Methods Results

I would like to acknowledge those who have helped me throughout the duration of this project, Dr. Predrag Radivojac, Dr. Haixu Tang, and Wyatt Clark

I welcome your questions and/or comments

An example: acetylation 1. word “acetylat” A = {a 1, a 2, …, a m } 2. regular expression (k  lys  lysine)(space) * (digit) + R = {r 1, r 2, …, r n } Background Previous Work Methods Results

An example: acetylation Background Previous Work Methods Results Score for article S: and where