1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.

Slides:



Advertisements
Similar presentations
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Advertisements

QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Requirements Engineering n Elicit requirements from customer  Information and control needs, product function and behavior, overall product performance,
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Knowledge Representation and Semantic Capturing Albena Strupchanska Linguistic Modelling Department, Institute for Parallel Processing, Bulgarian Academy.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Information Extraction From Medical Records by Alexander Barsky.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Systems Analysis and Design in a Changing World, 6th Edition 1 Chapter 4 - Domain Classes.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS3773 Software Engineering Lecture 04 UML Class Diagram.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Mining the Biomedical Research Literature Ken Baclawski.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Using Semantic Relations to Improve Information Retrieval
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
LaSIE: The Large Scale Information Extraction System Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
COP Introduction to Database Structures
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Natural Language Processing (NLP)
Social Knowledge Mining
Chapter 20 Object-Oriented Analysis and Design
Metadata Framework as the basis for Metadata-driven Architecture
Using Uneven Margins SVM and Perceptron for IE
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George Demetriou, & Robert Gaizauskas Department of Computer Science, University of Sheffield (Pacific Symposium on Biocomputing, Vol 5, Pages , 2000)

2 Abstract The application of technology to the extraction of information from scientific journal papers in the area of molecular biology. Two bioniformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure.

3 1. Introduction The prototypical IE tasks are those defined by the U.S. DARPA MUCs, requiring the filling of a complex template from newswire texts on subjects such as joint venture announcements, management succession events, or rocket launchings. This paper described the use of the technology developed through MUC evaluations in two bioinformatics applications.

4 2. IE Technology MUC-7 specified five separate component tasks: –Named Entity recognition: organizations, persons, locations, dates and monetary amounts. –Coreference resolution: the identification of expressions that refer to the same object, set or activity. –Template Element filling: the filling of small scale templates for specified classes of entity in the texts. –Template Relation filling: fill a two slot template representing a binary relation with pointers. –Scenario Template filling: the detection and construction of relations between template elements as participants in a particular type of event, or scenario.

5 3. Two Bioinformatics Applications of IE (1/2) EMPathIE –Enzyme and Metabolic Pathways Information Extraction. –Aimed to extract details of enzyme reactions from articles in the journals Biochimica et Biophysica Acta and FEMS Microbiology Letters. –Typically, journal articles in this domain describe details of a single enzyme reaction, often with little indication of related reactions and which pathways the reaction may be part of. => Combine details from several articles for pathway identification.

6 3. Two Bioinformatics Applications of IE (2/2) PASTA –Protein Active Site Template Acquisition –Aimed to extract information concerning the roles of amino acids in protein molecules, and to create a database of protein active sites from both scientific journal abstracts and full articles. –New protein structures are being reported at very high rates and the number of co-ordinate sets (currently about 9000) in the Protein Data Bank (PDB) can be expected to increase ten-fold in the next five years. –Computational methods would be very useful to biologists in comparison classification work and to those engaged in modeling studies.

7 3.1 EMPathIE (1/2) The EMP database contains over 20,000 records of enzyme reactions, collected from journal articles published since => provide for training data. Template definitions: –Three Template Elements: enzyme, organism and compound. –A single Template Relation: source, relating enzyme and organism elements –A scenario Template for the specific metabolic pathway task.

8 3.1 EMPathIE (2/2) A manually produced sample Scenario Template, taken from an article on ‘isocitrate lyase activity’ in FEMS Microbiology Letters. 乙醛酸循環

9 3.2 PASTA (1/3) The entities to be extracted: –proteins –amino acid residues –species –types of structural characteristics secondary structure, quaternary structure –active sites –other (probably less important) regions –chains –Interactions hydrogen bonds, disulphide bonds etc.

PASTA (2/3)

PASTA (3/3)

12 4. EMPathIE and PASTA (1/2) The IE systems are both derived from the LaSIE system, a general purpose IE system, under development at Sheffield since The processing modules:

13 4. EMPathIE and PASTA (2/2) Both systems have a pipeline architecture consisting of four principal stages. –Text preprocessingText preprocessing SGML/structure analysis, tokenisation –Lexical and terminological processingLexical and terminological processing Terminology lexicons, morphological analysis, terminology grammars –Parsing and semantic interpretationParsing and semantic interpretation Sentence boundary detection, part-of-speech tagging, phrase grammars, semantic interpretation –Discourse interpretationDiscourse interpretation Coreference resolution, domain modeling

Text Preprocessing Both the SGML and sectioniser modules may specify that certain text regions are to be excluded from any subsequent processing, avoiding detailed processing of apparently irrelevant text. The tokenisation of the input needs to identify tokens within compound names.

Lexical and Terminological Preprocessing (1/3) The main information sources used for terminology identification: –Case-insensitive terminology lexicons –Listing component terms of various categories –Morphological cues: standard biochemical suffixes –Hand-constructed grammar rules for each terminology class

Lexical and Terminological Preprocessing (2/3) The enzyme name mannitol-1-phosphate 5- dehydrogenase would be recognized firstly by the classification of mannitol as a potential compound modifier, and phosphate as a compound, both by being matched in the terminology lexicon. Morphological analysis would suggest dehydrogenase as a potential enzyme head, due to its suffix -ase. Grammar rules would apply to combine the enzyme head with a known compound and modifier which can play the role of enzyme modifier.

Lexical and Terminological Preprocessing (3/3) The biochemical terminology lexicons, assembled from various publicly available resources (e.g. SWISS-PROT), have been structured to distinguish various term components which are then assembled by grammar rules. The total number of lexicon entries is approximate 25,000 component terms at present in 52 categories.

Parsing and Semantic Interpretation The syntactic processing modules treat any terms recognized in the previous stage as non-decomposable units, with a syntactic role of proper noun. The POS tagger only attempts to assign tags to tokens which are not part of proposed terms. The phrasal grammar includes compositional semantic rules, which are used to construct a semantic representation of the ‘best’, possibly partial.

Discourse Interpretation(1/2) The discourse interpreter adds the semantic representation of each sentence to a predefined domain model, made up of ontology, or concept hierarchy, plus inheritable properties and inference rules associated with concepts. The domain model is gradually populated with instances of concepts from the text to become a discourse model. Coreference mechanism attempts to merge each newly introduced instance with an existing one, subject to various syntactic and semantic constraints.

Discourse Interpretation(2/2) The template writer module reads off the required information from the final discourse model and formats it as in the template specification. An initial domain model for the EMPathIE metabolic pathway task has been manually constructed, directly from the template definition, and subsequent refinement will involve extending the concept subhierarchies and the addition of coreference constraints on the hypothesised instances, based on available training data.

21 5. Results & Evaluation(1/2) A complete prototype EMPathIE system exists which can produce filled templates. The terminology recognition portion has been informally reviewed by molecular biologists.=> remarkably good The PASTA system has been implemented as far as the terminology recognition stage. Preliminary template design has been carried out, and being starting to build a domain model. A corpus of 52 abstracts of journal articles has been manually annotated with classes.=>allow an automatic evaluation of the PASTA terminology system using the MUC scoring software.

22 5. Results & Evaluation(2/2) Initial Named Entity results for the PASTA system

23 6. Conclusion These two projects move IE systems into the molecular biology domain much of the low-level work. Generalize the software to longer, multi-sectioned articles with embedded SGML. Generalize tokenisation routines to cope with scientific nomenclature. Generalize terminology recognition procedures to deal with a broad range of molecular biological terminology. Make good progress in designing template elements, template relations, and scenario templates.