Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting Annotation Layers for Natural Language Processing

Similar presentations


Presentation on theme: "Supporting Annotation Layers for Natural Language Processing"— Presentation transcript:

1 Supporting Annotation Layers for Natural Language Processing
Preslav Nakov Ariel Schwartz Brian Wolf Marti Hearst CS & SIMS UC Berkeley Example: Chemical–Disease Interactions Project overview Annotation Layers Example We demonstrate a system for flexible querying of text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. We present the Layered Query Language (LQL) and its use on examples taken from the NLP literature. NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN NP PP NP VP PP NP NP PP NP D019254 D044465 D001769 D002477 D003643 D001773 D016923 D007962 24224 596 281020 12043 POS Shallow parse Ontology Gene/protein Word Part of Speech Shallow Parse Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53. D016158 397276 42722 “Adherence to statin prevents one coronary heart disease event for every 429 patients.” Goal: extract the relation that statin (potentially) prevents coronary heart disease. MeSH C subtree contains diseases MeSH supplementary concepts represent chemicals. LQL query to find potentially useful sentences : Project url: Project support: NSF-DBI & Genentech FROM [layer=‘sentence’ { NO ORDER, ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’chemicals’] AS chemical $ ] [layer=‘MeSH’ && tree_number BELOW “C”] AS disease $ ] AS sent SELECT chemical.content, disease.content, sent.content Full parse, sentence and section layers are not shown. Framework Annotations are stored independently of text in an RDBMS Declarative query language for annotation retrieval Indexing structure designed for efficient query processing Layered Query Language for easy retrieval Object Oriented API for annotations: insertion, deletion and modification Based on benchmarking, we use Archictecture 5 Indexing Architectures PMID PMID SECTION SECTION LAYER LAYER START START END TAG TAG SEQUE SEQUE SENTE SENTE WORD WORD FIRST WORD POS LAST WORD POS ID ID CHAR CHAR CHAR TYPE TYPE NCE NCE NCE NCE ID ID POS POS POS POS POS 3345 3345 b (body) b (body) 0 (word) 34 34 39 39 59571 59571 1 2 59571 59571 1 1 This query extracts sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements (ALLOW GAPS). Requires one of the NPs to end with a chemical ($), and the other to end with a MeSH term from the C subtree (BELOW). 3345 3345 b b 41 41 48 48 55608 55608 2 2 55608 55608 2 2 3345 3345 b b 50 50 54 54 89985 89985 3 2 89985 89985 3 3 3345 3345 b b 1 (POS) 1 (POS) 34 34 39 39 27 (NN) 27 (NN) 1 2 59571 59571 1 1 3345 3345 b b 1 1 41 41 48 48 53 (VB) 53 (VB) 2 2 55608 55608 2 2 3345 3345 b b 1 1 50 50 54 54 27 27 3 2 89985 89985 3 3 3345 3345 b b 3(s.parse) 3(s.parse) 34 34 39 39 31(NP) 31(NP) 1 2 1 1 Basic architecture Added, architecture 3 Added, architecture 5 Added, architecture 2 Added, architecture 4 Key Contributions Multiple overlapping layers (cannot be expressed in a single XML file) Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) Specialized query language Flexible results format Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations Example: Protein-Protein Interactions Related Work Tree systems Overview: see (Bird et al.,2005); Examples:TGrep2, TIGERSearch, LPath, CorpusSearch, GSearch, Linguist’s Search Engine, Netgraph, TIQL, VIQTORIA, etc. Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene. Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined. (Cassidy&Harrington,2001) NiteQL (the query language of MATE): highly expressive, allows quering of intersecting hierarchies; stored in XML (McKelvie&al., 2001); TIQL: queries manipulate intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002) Annotation graphs: directed acyclic graph; nodes can have time stamps, constrained via paths to labeled parents and children. (Bird and Liberman, 2001) The LQL Query SELECT p1.content, verb.content, p2.content, COUNT(*) AS cnt ( BEGIN_LQL [layer=‘sentence’ { ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’gene’] $ ] AS p1 [layer=‘pos’ && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer=‘gene’] $ ] SELECT p1.content, verb.content, p2.content END_LQL ) GROUP BY p1.content, verb.content, p2.content ORDER BY cnt DESC 1.4 million MEDLINE abstracts 10 million sentences annotated 320 million multi-layered annotations 70 GB database size. Layers of Annotations Sample Output Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Layers can be Sequential Overlapping (e.g., two multiple-word concepts sharing a word) Hierarchical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY Ca2 activates protein kinase 312 Cln3 activate 234 TAP binds transcription factor 192 TNF protein tyrosine kinase 133 serine/threonine kinase binding RhoA GTPase 132 Phospholamban inhibits ATPase 114 PRL activated 108 Interleukin 2 84 Prolactin AMPA 78 Nerve growth factor LPS inhibited MHC class II 75 Heat shock protein Binding p59 72 EPO STAT5 63 EGF PP2A 60 cis Sp1 50 Summary A mechanism to effectively store and query layers of textual annotations. Evaluated various structures for data storage and have arrived at an efficient and simple one. Implemented a concise and powerful annotation query language (LQL). Built a web interface Planning to release the software to the research community.


Download ppt "Supporting Annotation Layers for Natural Language Processing"

Similar presentations


Ads by Google