Download presentation
Presentation is loading. Please wait.
Published byArthur Paul Modified over 9 years ago
1
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.eduhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
2
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
3
Overview Motivation: NLP processing requires re-use of results for additional processing: pipeline for end applications: data mining, IR, etc. Proposed solution: Layers of annotations over text Layered Query Language (LQL) Illustration: Application to noun compound bracketing
4
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
5
Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.
6
Dependency vs. Adjacency dependency modeladjacency model rightleft
7
Related Work Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web n-grams paraphrases surface features or #, MI, 2 Pr that w 1 precedes w 2 dependency modeladjacency model
8
Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right Majority vote to combine the different models Accuracy 89.34% (on the Lauer’s set: baseline 66.70%, previous best result: 80.70% ) state of the art
9
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
10
Web Counts: Problems Page hits are inaccurate maybe not that bad (Keller&Lapata,2003) The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition
11
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
12
Solution: MEDLINE+LQL MEDLINE: ~13M abstracts We annotated: ~1.4M abstracts ~10M sentences ~320M annotations Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/
13
The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)
14
Annotated Example LocusLink MeSH
15
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
16
Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.text layers’ beginnings should match layers’ endings should match By default: nothing can go in between
17
Noun Compound Extraction (2) SELECT LOWER(comp.text) AS lc, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.text END_LQL ) AS comp GROUP BY lc ORDER BY freq DESC BUT! Does not allow adjectives determiners etc.
18
Noun Compound Extraction (3) SELECT LOWER(comp.text) AS lc, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_name="noun"] ( [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] ) $ ) $ ] AS compound SELECT compound.text END_LQL ) AS comp GROUP BY lc ORDER BY freq DESC layer negation artificial range For details: paper online demo demo at ACL
19
Finding Bigram Counts SELECT COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ [layer=’pos’ && tag_name=“noun” && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_name=“noun” && (content="virus"||content="viruses")] ] SELECT word1.content END_LQL ) AS word ORDER BY freq DESC Inflections: UMLS Specialist lexicon just count
20
Paraphrases Types of noun compounds (Warren,1978): Paraphrasable Prepositional immunodeficiency virus in humans right Verbal virus causing human immunodeficiency left immunodeficiency virus found in humans right Copula immunodeficiency virus that is human right Other
21
Prepositional Paraphrases SELECT LOWER(prp.content) lp, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_name="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_name="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_name=’IN’] AS prep ?[layer=’pos’ && tag_name=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_name="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL ) AS prp GROUP BY lp, ORDER BY freq DESC optional layer
22
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
23
obtained 418,678 noun compounds (NCs) annotated the top 232 NCs agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases
24
Results correct N/Awrong
25
Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from (the) bone marrow (22 instances) cells isolated from (the) bone marrow (14 instances) top prepositional paraphrases cells in (the) bone marrow (456 instances) cells from (the) bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms Web cannot do it!
26
The End Thank you! Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.