Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech
Project overview A system for flexible querying of text that has been annotated with the results of NLP processing. Supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. Designed to scale to very large corpora. Demo of LQL (Layered Query Language) on examples taken from the NLP literature.
Key Contributions Multiple overlapping layers (cannot be expressed in a single XML file) Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) Specialized query language Flexible results format Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations 1.4 million MEDLINE abstracts 10 million sentences annotated 320 million multi-layered annotations 70 GB database size.
Layers of Annotations Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Layers can be Sequential Overlapping (e.g., two multiple-word concepts sharing a word) Hierarchical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology
Annotation Layers Example
System Architecture (Main table) ANNOTATION_IDPMIDSECTIONLAYER_IDSENTENCE FIRST_ WORD_POS LAST_ WORD_POS TAG_TYPEWORD_ID START_ CHAR_POS END_ CHAR_POS t t t t t t t300231None t302334None t303531None t4005None t None t None3540
System Architecture (Indexes) (Forward) +doc_id+section+layer_id+sentence+first_ word_pos+last_word_pos+tag_type (Inverted) +layer_id+tag_type+doc_id+section+sente nce+first_word_pos+last_word_pos (Inverted) +word_id+layer_id+tag_type+doc_id+secti on+sentence+first_word_pos
Example query I Protein-Protein Interactions Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.
Example query I - LQL SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC
Example query I – Sample output PROTEIN 1INTERACTION VERBPROTEIN 2FREQUENCY Ca2activatesprotein kinase312 Cln3activateprotein kinase234 TAPbindstranscription factor192 TNFactivatesprotein tyrosine kinase133 serine/threonine kinasebindingRhoA GTPase132 PhospholambaninhibitsATPase114 PRLactivatedtranscription factor108 Interleukin 2activatestranscription factor84 Prolactinactivatestranscription factor84 AMPAactivatedprotein kinase78 Nerve growth factoractivatesprotein kinase78 LPSinhibitedMHC class II75 Heat shock proteinBindingp5972 EPOactivatedSTAT563 EGFactivatedPP2A60 cisbindsSp150
Example query II Chemical–Disease Interactions “Adherence to statin prevents one coronary heart disease event for every 429 patients.” Goal: extract the relation that statin (potentially) prevents coronary heart disease. MeSH C subtree contains diseases MeSH supplementary concepts represent chemicals.
Example query II - LQL [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number ~ 'C%'] AS disease $ ] ] AS sent SELECT sent.pmid, chemical.text, disease.text, sent.text