IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text Syed Toufeeq Ahmed Deepthi Chidambaram Hasan Davulcu Chitta Baral.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand.
Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
An Information Retrieval and Extraction System for C. elegans Literature.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Prediction of Therapeutic microRNA based on the Human Metabolic Network Ming Wu, Christina Chan Bioinformatics Advance Access Published January 7, 2014.
From Words to Knowledge ORION Active Structure. ORION Active Structure Two Approaches We could separate the process of turning words into knowledge into.
An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
XML Documentation of Biopathways and Their Simulations in Genomic Object Net Speaker : Hungwei chen.
Research and objectives Modern software is incredibly complex: for example, a modern OS has more than 10 millions lines of code, organized in 10s of layers!
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
BioNetGen: a system for modeling the dynamics of protein-protein interactions Bill Hlavacek Theoretical Biology and Biophysics Group Los Alamos National.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Copyright © Ariadne Genomics, Inc. All Rights Reserved Molecular Networks in Mammals: Extraction from Literature and Microarray Analysis by Ilya.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Protein-protein Interactions Hsueh-Fen Juan 2003, Mar 31 NTNU.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Internet tools for genomic analysis: part 2
Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International.
Link Grammar ( by Davy Temperley, Daniel Sleator & John Lafferty ) Syed Toufeeq Ahmed ASU.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Pathways Database System: An Integrated System For Biological Pathways L. Krishnamurthy, J. Nadeau, G. Ozsoyoglu, M. Ozsoyoglu, G. Schaeffer, M. Tasan.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Ch10. Intermolecular Interactions and Biological Pathways
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Synthetic biology: New engineering rules for emerging discipline Andrianantoandro E; Basu S; Karig D K; Weiss R. Molecular Systems Biology 2006.
Study of Automated Extraction of Security Policy from Natural-Language Software Documents * Nov. 21, 2013, Kaidi Ma, Man Sun Computer Information Science.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Bioinformatics Dr. Víctor Treviño BT4007
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Flexible Text Mining using Interactive Information Extraction David Milward
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
A Biology Primer Part IV: Gene networks and systems biology Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A collaborative tool for sequence annotation. Contact:
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction from BioMedical Abstracts Dr. Hasan Davulcu Syed Toufeeq Ahmed Deepthi Chidambaram.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
High throughput biology data management and data intensive computing drivers George Michaels.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Modern Systems Analysis and Design Third Edition
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Writing Analytics Clayton Clemens Vive Kumar.
Modern Systems Analysis and Design Third Edition
Extracting Semantic Concept Relations
Complex Sentence Processor
Modern Systems Analysis and Design Third Edition
Batyr Charyyev.
A Link Grammar for an Agglutinative Language
Presentation transcript:

IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text Syed Toufeeq Ahmed Deepthi Chidambaram Hasan Davulcu Chitta Baral

Outline Introduction Issues and Challenges Our Approach (IntEx System) Evaluation Future Work Conclusion Demo

Introduction Genomic Research in the last decade has resulted in humongous amount of data, and most of these findings are in form of free text. PubMed/ MedLine has around 12 millions abstracts online. An automated tool to extract information from free text (bio-medical) will be of great use to researchers (biologists).

Issues that make extraction difficult (Seymore, McCallum et al.1999) The task involves free text – hence there are many ways of stating the same fact. The genre of text is not grammatically simple. The text includes a lot of technical terminology unfamiliar to existing natural language processing systems. Information may need to be combined across several sentences. There are many sentences from which nothing should be extracted.

Challenges Interactions specified in different ways 1. HMBA inhibits MEC-1 cell proliferation. 2. GBMs commonly overexpress the oncogenes EGFR and PDGFR, and contain mutations and deletions of tumor suppressor genes PTEN and TP Protein kinase B (PKB) has emerged as the focal point for many signal transduction pathways, regulating multiple cellular processes such as glucose metabolism, transcription, apoptosis, cell proliferation, angiogenesis, and cell motility.

Challenges (cont.) Anaphora resolution Pronominals – “It activates HMBA”. Sortal anaphora – “Both enzymes are phosphorylated”. Event anaphora – “This reaction acts in a mediated environment.” Multiple interactions in Complex sentences Most of the tumor-suppressive properties of Pten are dependent on its lipid phosphatase activity, which inhibits the phosphatidylinositol-3'-kinase (PI3K)/Akt signaling pathway through dephosphorylation of phosphatidylinositol-(3,4,5)-triphosphate

Our Approach (IntEx System) Identify syntactic roles, such as Subject, Object, Verb and modifiers of a sentence. Using these syntactic roles, transform complex sentences into multiple simple clauses. Extract Protein-Protein interactions from these simple clausal structures. Simple Pronoun resolution to identify references across multiple sentences.

IntEx System Architecture

IntEx System Components Pronoun Resolution Tagging: tagging biological entities with the help of biomedical and linguistic gazetteers. Complex Sentence Processing: splitting complex sentences into simple clausal structures made of up syntactic roles. Interaction Extractor: extracting complete interactions by analyzing the matching contents of syntactic roles and their linguistically significant combinations.

Pronoun Resolution Ku loads onto dsDNA ends and it can diffuse along the DNA in an energy- independent manner. Ku loads onto dsDNA ends and Ku can diffuse along the DNA in an energy-independent manner.  Pronouns in abstracts – third person -It, itself, them, themselves.  Replace pronouns with first noun group that matches the Person/number agreement.

Tagging Dictionary lookup using gene/protein gazetteers from UMLS, LocusLink etc.. To tag new gene names, we used regular expressions (alpha numeric names, combination of lower case and upper case characters etc..). Some heuristics like using proper nouns, NP chunking to improve recall. ‘Interaction word’ list is derived from UMLS and WordNet.

Complex Sentence Structures Independent clauses with connectives Many dependent clauses with one independent clause with / without connectives Multiple agents and goals in a single clause Gene14 binds to Gene15 in response to Gene16 or methylmethanesulfonate ; this interaction does not require Gene17.. Gene57 is blocked by Gene61, which binds to Gene62. Gene96 or Gene97 competes with Gene98 for binding to Gene99 and Gene100 or Gene101 stimulates Gene102 in vitro in the absence of Gene105.

Complex Sentence Processing Upon growth factor stimulation of quiescent cells, Gene100 declines late in Gene101 and Gene102 is replaced by Gene103, which is absent in quiescent cells. Upon growth factor stimulation of quiescent cells, Gene100 declines late in Gene101. Gene102 is replaced by Gene103. Gene103 is absent in quiescent cells.

Complex Sentence Processing Verb-based approach. Identify clauses in complex sentences using Link Grammar Linkages Build simple clause sentences from them (for each main verb) in the following Clause Format: Subject | Verb | Object | Modifying phrase

Link Grammar Parser ( Sleator, D. and D. Temperley,1993 ) Sentence: “The cat chased a snake” Link Grammar Representation:

Interaction Extractor: Role Type Matching Role TypeDescription ElementaryIf the role contains a Protein name or an interaction word. PartialIf the role has a Protein name and an interaction word. CompleteIf the role has at least two Protein names and an interaction word. Various syntactic roles (such as Subject, Object and Modifying phrase) and their linguistically significant combinations makes up roles

Roles: Examples Elementary (Subject) Elementary (Object) Partial (Modifying Phrase) “HMBA could inhibit the MEC-1 cell proliferation by down-regulation of PCNA expression.” Interaction (Verb)

Interaction Extractor Algorithm complete (G,I,G)  interact: {G,I,G} complete (G,I,G)  interact: {G,I,G} complete (G,I,G)  interact: {G,I,G} Elementary (G1) Elementary (G2) Is Main Verb an Interaction (I) ? Interaction : { G1, I, G2 } Partial (I,G2) Interaction : { G1, I, G2 }

Interaction Extractor Example Elementary Partial “HMBA could inhibit the MEC-1 cell proliferation by down-regulation of PCNA expression.” Main Verb { “HMBA”, “inhibit”, “the MEC-1 cell proliferation” } { “HMBA”, “down-regulation”, “PCNA expression”}

A Detailed Overall Example

Evaluation ( Recall comparison with BioRAT ) IntEx and BioRAT from 229 abstracts when compared with DIP database. DIP (Database of Interacting Proteins) – is a database of proteins that interact, and is curated from both abstracts and full text. Recall Results IntExBioRAT CasesPercent (%)CasesPercent(%) Match No Match Totals

Evaluation ( Precision comparison with BioRAT ) Precision Results IntExBioRAT CasesPercent (%)CasesPercent (%) Correct Incorrect Totals Precision comparison of IntEx and BioRAT from 229 abstracts.

Errors Analysis

Future Work in Interaction Extraction Handling negations in the sentences (such as “not interact”, “fails to induce”, “does not inhibit”). Extraction of detailed contextual attributes of interactions (such as bio-chemical context or location) by interpreting modifiers: Location/Position modifiers (in, at, on, into, up, over…) Agent/Accompaniment modifiers (by, with…) Purpose modifiers( for…) Theme/association modifiers ( of..) Extraction of relationships between interactions from among multiple sentences within and across abstracts/full text articles. (Protein Interaction Pathways)

A bigger future: combining automated extraction with mass collaboration `Curation’ is expensive. Automated extraction – miles to go Vision: automated extraction with mass curation The CBioC system:

Future Work: Visualization

Conclusion Verb-based approach to extract protein- protein interactions Handles complex sentences Easy to scale up, and to use in other domains (we are working on it to use on other domains too). Protein name tagging needs improvement, and we are working on using other methods. First release version is almost ready for both Windows and Linux platforms.

References Link Grammar: LocusLink (Now Entrez Gene): UMLS:

References (cont.) Blaschke, C., M. A. Andrade, et al. (1999). "Automatic extraction of biological information from scientific text: Protein-protein interactions." Proceedings of International Symposium on Molecular Biology: Corney, D. P. A., B. F. Buxton, et al. (2004). "BioRAT: extracting biological information from full-length papers." Bioinformatics 20(17): Friedman, C., P. Kra, et al. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Proceedings of the International Confernce on Intelligent Systems for Molecular Biology: Rzhetsky, A., I. Iossifov, et al. (2004). "GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data." J. of Biomedical Informatics 37(1): Seymore, K., A. McCallum, et al. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction Sleator, D. and D. Temperley (1993). Parsing English with a Link Grammar. Third International Workshop on Parsing Technologies.

Demo

Thank you !