Download presentation
Presentation is loading. Please wait.
1
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
2
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
3
Overview Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc. Proposed solution: Layers of annotations over text Illustration: Application to noun compound bracketing
4
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
5
Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.
6
Related Work Pustejosky et al. (1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 3 ) vs. Pr(w 2 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features
7
Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right Majority vote to combine different models Accuracy 89.34%
8
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
9
Web Counts: Problems The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) “health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition Page hits are inaccurate
10
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
11
Solution: MEDLINE+LQL MEDLINE: ~13 million abstracts We annotated: 1.4 million abstracts ~10 million sentences ~320 million annotations Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/
12
The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)
13
Annotated Example
14
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
15
Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ beginnings should match layers’ endings should match
16
Noun Compound Extraction (2) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC
17
Noun Compound Extraction (3) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC layer negation artificial range
18
Finding Bigram Counts SELECT COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUP BY lc ORDER BY freq DESC
19
Paraphrases Types of paraphrases (Warren,1978): Prepositional immunodeficiency virus in humans right Verbal virus causing human immunodeficiency left immunodeficiency virus found in humans left Copula immunodeficiency virus that is human right
20
Prepositional Paraphrases SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer
21
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
22
obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning) agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases for inflections, we used UMLS
23
Results correct N/Awrong
24
Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances) top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms
25
The End Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.