Supporting Annotation Layers for Natural Language Processing

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

XML DOCUMENTS AND DATABASES

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Chapter 7 Structuring System Process Requirements

Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.

Information Retrieval in Practice

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.

Overview of Search Engines

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

Chapter 7 Structuring System Process Requirements

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

Survey of Semantic Annotation Platforms

Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.

ATLAS Demystified: A Practical Introduction Christophe Laprun, Jonathan Fiscus, John Garofolo, Sylvain Pajot National Institute of Standards and Technology.

The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.

October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.

Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)

Flexible Text Mining using Interactive Information Extraction David Milward

Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar.

MedKAT Medical Knowledge Analysis Tool December 2009.

Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.

RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.

RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.

SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.

Information Retrieval in Practice

Database Systems: Design, Implementation, and Management Tenth Edition

CS 405G: Introduction to Database Systems

CS4222 Principles of Database System

XML: Extensible Markup Language

The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies

Search Engine Architecture

CS122B: Projects in Databases and Web Applications Winter 2017

Chapter 9 Database Systems

Database Management System

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Supporting Annotation Layers for Natural Language Processing

Memory Standardization

Natural Language Processing (NLP)

CHAPTER 3 Architectures for Distributed Systems

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Supporting Annotation Layers for Natural Language Processing

Ahmet Fatih Mustacoglu

MANAGING DATA RESOURCES

LING/C SC 581: Advanced Computational Linguistics

Database management concepts

File Systems and Databases

Supporting Annotation Layers for Natural Language Processing

Database management concepts

PostgreSQL as a Document Storage for .NET applications

Database Management Systems

Biomedical Language Processing: What's Beyond PubMed?

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Presentation transcript:

Supporting Annotation Layers for Natural Language Processing Preslav Nakov Ariel Schwartz Brian Wolf Marti Hearst CS & SIMS UC Berkeley Example: Chemical–Disease Interactions Project overview Annotation Layers Example We demonstrate a system for flexible querying of text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. We present the Layered Query Language (LQL) and its use on examples taken from the NLP literature. NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN NP PP NP VP PP NP NP PP NP D019254 D044465 D001769 D002477 D003643 D001773 D016923 D007962 24224 596 281020 12043 POS Shallow parse Ontology Gene/protein 185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523 Word Part of Speech Shallow Parse Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53. D016158 397276 42722 “Adherence to statin prevents one coronary heart disease event for every 429 patients.” Goal: extract the relation that statin (potentially) prevents coronary heart disease. MeSH C subtree contains diseases MeSH supplementary concepts represent chemicals. LQL query to find potentially useful sentences : Project url: http://biotext.berkeley.edu/lql Project support: NSF-DBI-0317510 & Genentech FROM [layer=‘sentence’ { NO ORDER, ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’chemicals’] AS chemical $ ] [layer=‘MeSH’ && tree_number BELOW “C”] AS disease $ ] AS sent SELECT chemical.content, disease.content, sent.content Full parse, sentence and section layers are not shown. Framework Annotations are stored independently of text in an RDBMS Declarative query language for annotation retrieval Indexing structure designed for efficient query processing Layered Query Language for easy retrieval Object Oriented API for annotations: insertion, deletion and modification Based on benchmarking, we use Archictecture 5 Indexing Architectures PMID PMID SECTION SECTION LAYER LAYER START START END TAG TAG SEQUE SEQUE SENTE SENTE WORD WORD FIRST WORD POS LAST WORD POS ID ID CHAR CHAR CHAR TYPE TYPE NCE NCE NCE NCE ID ID POS POS POS POS POS 3345 3345 b (body) b (body) 0 (word) 34 34 39 39 59571 59571 1 2 59571 59571 1 1 This query extracts sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements (ALLOW GAPS). Requires one of the NPs to end with a chemical ($), and the other to end with a MeSH term from the C subtree (BELOW). 3345 3345 b b 41 41 48 48 55608 55608 2 2 55608 55608 2 2 3345 3345 b b 50 50 54 54 89985 89985 3 2 89985 89985 3 3 3345 3345 b b 1 (POS) 1 (POS) 34 34 39 39 27 (NN) 27 (NN) 1 2 59571 59571 1 1 3345 3345 b b 1 1 41 41 48 48 53 (VB) 53 (VB) 2 2 55608 55608 2 2 3345 3345 b b 1 1 50 50 54 54 27 27 3 2 89985 89985 3 3 3345 3345 b b 3(s.parse) 3(s.parse) 34 34 39 39 31(NP) 31(NP) 1 2 1 1 Basic architecture Added, architecture 3 Added, architecture 5 Added, architecture 2 Added, architecture 4 Key Contributions Multiple overlapping layers (cannot be expressed in a single XML file) Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) Specialized query language Flexible results format Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations Example: Protein-Protein Interactions Related Work Tree systems Overview: see (Bird et al.,2005); Examples:TGrep2, TIGERSearch, LPath, CorpusSearch, GSearch, Linguist’s Search Engine, Netgraph, TIQL, VIQTORIA, etc. Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene. Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined. (Cassidy&Harrington,2001) NiteQL (the query language of MATE): highly expressive, allows quering of intersecting hierarchies; stored in XML (McKelvie&al., 2001); TIQL: queries manipulate intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002) Annotation graphs: directed acyclic graph; nodes can have time stamps, constrained via paths to labeled parents and children. (Bird and Liberman, 2001) The LQL Query SELECT p1.content, verb.content, p2.content, COUNT(*) AS cnt ( BEGIN_LQL [layer=‘sentence’ { ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’gene’] $ ] AS p1 [layer=‘pos’ && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer=‘gene’] $ ] SELECT p1.content, verb.content, p2.content END_LQL ) GROUP BY p1.content, verb.content, p2.content ORDER BY cnt DESC 1.4 million MEDLINE abstracts 10 million sentences annotated 320 million multi-layered annotations 70 GB database size. Layers of Annotations Sample Output Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Layers can be Sequential Overlapping (e.g., two multiple-word concepts sharing a word) Hierarchical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY Ca2 activates protein kinase 312 Cln3 activate 234 TAP binds transcription factor 192 TNF protein tyrosine kinase 133 serine/threonine kinase binding RhoA GTPase 132 Phospholamban inhibits ATPase 114 PRL activated 108 Interleukin 2 84 Prolactin AMPA 78 Nerve growth factor LPS inhibited MHC class II 75 Heat shock protein Binding p59 72 EPO STAT5 63 EGF PP2A 60 cis Sp1 50 Summary A mechanism to effectively store and query layers of textual annotations. Evaluated various structures for data storage and have arrived at an efficient and simple one. Implemented a concise and powerful annotation query language (LQL). Built a web interface Planning to release the software to the research community.