Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Overview Motivation: NLP processing requires re-use of results for additional processing: pipeline for end applications: data mining, IR, etc. Proposed solution: Layers of annotations over text Layered Query Language (LQL) Illustration: Application to noun compound bracketing
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.
Dependency vs. Adjacency dependency modeladjacency model rightleft
Related Work Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web n-grams paraphrases surface features or #, MI, 2 Pr that w 1 precedes w 2 dependency modeladjacency model
Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right Majority vote to combine the different models Accuracy 89.34% (on the Lauer’s set: baseline 66.70%, previous best result: 80.70% ) state of the art
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Web Counts: Problems Page hits are inaccurate maybe not that bad (Keller&Lapata,2003) The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Solution: MEDLINE+LQL MEDLINE: ~13M abstracts We annotated: ~1.4M abstracts ~10M sentences ~320M annotations Layered Query Language: demo at ACL!
The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)
Annotated Example LocusLink MeSH
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.text layers’ beginnings should match layers’ endings should match By default: nothing can go in between
Noun Compound Extraction (2) SELECT LOWER(comp.text) AS lc, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.text END_LQL ) AS comp GROUP BY lc ORDER BY freq DESC BUT! Does not allow adjectives determiners etc.
Noun Compound Extraction (3) SELECT LOWER(comp.text) AS lc, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_name="noun"] ( [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] ) $ ) $ ] AS compound SELECT compound.text END_LQL ) AS comp GROUP BY lc ORDER BY freq DESC layer negation artificial range For details: paper online demo demo at ACL
Finding Bigram Counts SELECT COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ [layer=’pos’ && tag_name=“noun” && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_name=“noun” && (content="virus"||content="viruses")] ] SELECT word1.content END_LQL ) AS word ORDER BY freq DESC Inflections: UMLS Specialist lexicon just count
Paraphrases Types of noun compounds (Warren,1978): Paraphrasable Prepositional immunodeficiency virus in humans right Verbal virus causing human immunodeficiency left immunodeficiency virus found in humans right Copula immunodeficiency virus that is human right Other
Prepositional Paraphrases SELECT LOWER(prp.content) lp, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_name="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_name="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_name=’IN’] AS prep ?[layer=’pos’ && tag_name=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_name="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL ) AS prp GROUP BY lp, ORDER BY freq DESC optional layer
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
obtained 418,678 noun compounds (NCs) annotated the top 232 NCs agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases
Results correct N/Awrong
Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from (the) bone marrow (22 instances) cells isolated from (the) bone marrow (14 instances) top prepositional paraphrases cells in (the) bone marrow (456 instances) cells from (the) bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms Web cannot do it!
The End Thank you! Layered Query Language: demo at ACL!