Download presentation
Presentation is loading. Please wait.
Published byClyde Bishop Modified over 9 years ago
1
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley
2
2 Outline Introduction Related Work Models and Features
3
3 Introduction Noun compound bracketing -> Noun compound interpretation liver cell antibody [[liver cell] antibody] liver cell line [liver [cell line]] POS equivalent, different syntactic trees
4
4 This Paper A highly accurate unsupervised method for making bracketing decisions for noun compounds (NCs) Current: using bigram estimates to compute adjacency and dependency scores Improvement χ 2 measure a new set of surface features for querying Web search engines Evaluate on 2 domains, encyclopedia & bioscience
5
5 Related Work NC syntax and semantics Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions Adjacency model Probabilistic dependency model, Laucer (1995) Data sparseness (use categories instead) 244 NCs from encyclopedia Inter-annotator agreement 81.5% Baseline 66.8% -> 77.5% Adding POS -> state-of-the-art result of 80.7%
6
6 2003~2005 Keller and Lapata (2003) Use Web Search Engines for obtaining frequencies for unseen bigrams (2004) apply to six NLP tasks including disambiguation of NCs Simpler version (use frequency only) - 78.68% Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) 83.1%
7
7 Models and Features Adjacency and dependency model w 1 w 2 w 3 -> [w 1 [w 2 w 3 ]] (two reasons) take on right bracketing 1. w 2 w 3 is a compound (modified by w 1 ) home health care Adjacency model checks 1. 2. w 1 and w 2 independently modify w 3 adult male rat (Better) Dependency model checks 2. Left bracketing -> only 1 choice [law enforcement] agent
8
8 Computing Probabilities Alternative Calculations
9
9 χ 2 measure B=#(w i )-(A) C=#(w j )-(A) D=~N-A-B-C N=8T =google 8B pages X 1000 words/page ( Yang and Pedersen, 1997) χ 2 better than MI
10
10 蛋包飯蛋包飯 蛋 2067593 蛋包 2217 包 10207448 包飯 3398 飯 1672224 χ 2 包飯 750.34 > 蛋包 67.32
11
11 Web-Derived Surface (1/2) Authors sometimes (consciously or not) disambiguate the words they write by using surface-level markers to suggest the correct meaning. Dash (hyphen) left bracketing cell cycle analysis -> cell-cycle right bracketing less reliable donor T-cell fiber optics-system t-cell-depletion Possessive marker brain ’ s stem cells, brain stem ’ s cells, brain ’ s stem-cells Internal capitalization Plasmodium vivax Malaria, brain Stem cells disable this feature on Roman digits and single-letter words vitamin D deficiency
12
12 Web-Derived Surface (2/2) Embedded slashes leukemia/lymphoma cell growth factor (beta) or (growth factor) beta (brain) stem cells a comma, a dot or a colon “ health care, provider ” or “ lung cancer: patients ” (weak indicator) mouse-brain stem cells (weak indicator) Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc. collect them indirectly – post-processing the resulting summaries (up to 1000 results) Above features are clearly more reliable than others, we do not try to weight them Features verifying Counts returned by SE, page hits as a proxy for n-gram frequencies from 1000 summaries
13
13 Other Web-Derived Features Abbreviations tumor necrosis factor (NF) tumor necrosis (TN) factor Concatenation health care reform -> healthcare, carereform Wildcard (*) “ health care * reform ” “ health * care reform ” Reorder reform health care care reform health myosin heavy chain, heavy chain myosin Internal inflection variability tyrosine kinase activation, tyrosine kinases activation Switching “ adult male rat ”, we would also expect “ male adult rat ”.
14
14 新發現
15
15 Paraphrases Warren (1978) proposes stem cells in the brain cells from the brain stem Copula paraphrase office building that/which is a skyscraper pain associated with arthritis migraine search engines lack linguistic annotations small set of hand-chosen paraphrases associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for
16
16 Evaluations Lauer ’ s Dataset (1995) 244 unambiguous 3-noun NC-s Biomedical Dataset (Nakov et al., 2005, SIG BioLink) Open NLP tools sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003) 500 NCs, 361 left, 69 right, 70 ambiguous
17
17 Experiments used MSN Search statistics for the n- grams and the paraphrases (unless the pattern contained a “ * ” ) MSN always returned exact numbers Google for the surface features Google and Yahoo rounded their page hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)
18
18 Tools Mentioned UMLS Specialist lexicon 得到生物領域字不同的拼法 http://www.nlm.nih.gov/pubs/factsheets/u mlslex.html Carroll ’ s morphological tools http://www.cogs.susx.ac.uk/lab/nlp/carroll/ morph.html http://www.cogs.susx.ac.uk/lab/nlp/carroll/ morph.html
19
19 UMLS Lexicon {base=AAAentry=E0000049 cat=noun variants=metaregvariants=uncountacronym_of=abdominal aortic aneurysmectomy|E0429482acronym_of=acne-associated arthritis|E0429483acronym_of=acquired aplastic anemia|E0429484acronym_of=acute anxiety attack|E0429485 acronym_of=androgenic anabolic agent|E0429486 acronym_of=aneurysm of ascending aorta acronym_of=aromatic amino acid|E0356310acronym_of=acute apical abscess|E0356309abbreviation_of=abdominal aortic aneurysm|E0006446} {base=AAMD spelling_variant=A.A.M.D. entry=E0000050 cat=nounvariants=groupuncountacronym_of=American Association on Mental Deficiency|E0000277}
20
20
21
21
22
22
23
23 Conclusions and Future Work Improved upon the state-of-the-art approaches to NC bracketing Future include test on > 3 words recognize the ambiguous case Include determiners and modifiers on other NLP problems refine the parser output Parser typically assume right bracketing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.