Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.

Slides:

Advertisements

Similar presentations

Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.

Advertisements

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.

ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.

Database Basics Alan B. Marr, M.D., F.A.C.S. Associate Professor of Clinical Surgery.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Information Retrieval in Practice

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.

Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.

Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at

The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.

Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.

1 Data Integration and Extraction over Molecular Biological Data Cui Tao supported by NSF.

UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.

The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.

Object Oriented Concepts. Movement toward Objects Instead of data-oriented or process-oriented Analysis, many firms are now moving to object-oriented.

Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.

1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,

The Jikitou Biomedical Question Answering System: Using a Syntactic Parser to Rank Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E.

Overview of Search Engines

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using.

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.

Sekimo Solutions mentioned by the TEI  CONCUR: an optional feature of SGML (not XML) that allows multiple.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

Winter 2007SEG2101 Chapter 71 Chapter 7 Introduction to Languages and Compiler.

NoSQL Databases NoSQL Concepts SoftUni Team Technical Trainers Software University

Flexible Text Mining using Interactive Information Extraction David Milward

Lars Juhl Jensen Biomedical text mining. exponential growth.

University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.

Chapter 4: Use Case Modeling [Arlow and Neustadt, 2005] CS 790M Project preparation (II) University of Nevada, Reno Department of Computer Science & Engineering.

For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.

Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar.

Mining the Biomedical Research Literature Ken Baclawski.

MedKAT Medical Knowledge Analysis Tool December 2009.

Development of a Chicken Unigene Database Project No. 9 Mentors: Dr. Wellington Martins - Dr. Joan Burnside Animal Science Dept. University of Delaware.

Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.

Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.

A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William.

UBC Bioinformatics Centre Copyright 2004 UBC Bioinformatics Centre Common evidence network: Investigating Medline co-citations of candidate disease genes.

SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Information Retrieval in Practice

UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)

Search Engine Architecture

STRING Large-scale data and text mining

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Supporting Annotation Layers for Natural Language Processing

Natural Language Processing (NLP)

Supporting Annotation Layers for Natural Language Processing

Supporting Annotation Layers for Natural Language Processing

Text Analytics in ITS 2.0: Annotation of Named Entities

Machine Learning in Natural Language Processing

Supporting Annotation Layers for Natural Language Processing

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Presentation transcript:

Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech

Project overview A system for flexible querying of text that has been annotated with the results of NLP processing. Supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. Designed to scale to very large corpora. Demo of LQL (Layered Query Language) on examples taken from the NLP literature.

Key Contributions Multiple overlapping layers (cannot be expressed in a single XML file) Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) Specialized query language Flexible results format Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations 1.4 million MEDLINE abstracts 10 million sentences annotated 320 million multi-layered annotations 70 GB database size.

Layers of Annotations Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Layers can be Sequential Overlapping (e.g., two multiple-word concepts sharing a word) Hierarchical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology

Annotation Layers Example

System Architecture (Main table) ANNOTATION_IDPMIDSECTIONLAYER_IDSENTENCE FIRST_ WORD_POS LAST_ WORD_POS TAG_TYPEWORD_ID START_ CHAR_POS END_ CHAR_POS t t t t t t t300231None t302334None t303531None t4005None t None t None3540

System Architecture (Indexes) (Forward) +doc_id+section+layer_id+sentence+first_ word_pos+last_word_pos+tag_type (Inverted) +layer_id+tag_type+doc_id+section+sente nce+first_word_pos+last_word_pos (Inverted) +word_id+layer_id+tag_type+doc_id+secti on+sentence+first_word_pos

Example query I Protein-Protein Interactions Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.

Example query I - LQL SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC

Example query I – Sample output PROTEIN 1INTERACTION VERBPROTEIN 2FREQUENCY Ca2activatesprotein kinase312 Cln3activateprotein kinase234 TAPbindstranscription factor192 TNFactivatesprotein tyrosine kinase133 serine/threonine kinasebindingRhoA GTPase132 PhospholambaninhibitsATPase114 PRLactivatedtranscription factor108 Interleukin 2activatestranscription factor84 Prolactinactivatestranscription factor84 AMPAactivatedprotein kinase78 Nerve growth factoractivatesprotein kinase78 LPSinhibitedMHC class II75 Heat shock proteinBindingp5972 EPOactivatedSTAT563 EGFactivatedPP2A60 cisbindsSp150

Example query II Chemical–Disease Interactions “Adherence to statin prevents one coronary heart disease event for every 429 patients.” Goal: extract the relation that statin (potentially) prevents coronary heart disease. MeSH C subtree contains diseases MeSH supplementary concepts represent chemicals.

Example query II - LQL [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number ~ 'C%'] AS disease $ ] ] AS sent SELECT sent.pmid, chemical.text, disease.text, sent.text