Download presentation
Presentation is loading. Please wait.
1
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech
2
Overview Web as a corpus n-gram frequencies Concern: Instability of n-gram estimates Study the impact of the variability of the n-gram estimates for a particular task: Across time For different search engines (Not) using a language filter (Not) using inflections
3
Introduction (Banko & Brill, 2001) “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001 Simple task: choose from a set of commonly confused set of words for a given context, e.g. {principle, principal} Data comes for free: Assuming correct usage in the training raw text. Log-linear improvement even to billion words => Getting more data is better than fine-tuning algorithms. Today the obvious source of very large data is the Web.
4
Web as a Corpus Machine Translation (Grefenstette 98; Resnik 99; Cao & Li 02; Way & Gough 03) Question Answering (Dumais et al. 02; Soricut & Brill 04), Word Sense Disambiguation (Mihalcea & Moldovan 99; Rigau et al. 02; Santamar´ıa et al. 03; Zahariev 04), Extraction of Semantic Relations (Chklovski & Pantel 04; Idan Szpektor & Coppola 04; Shinzato & Torisawa 04), Anaphora Resolution: (Modjeska et al. 03), Prepositional Phrase Attachment: (Volk 01; Calvo & Gelbukh 03; Nakov & Hearst 05), Language Modeling: (Zhu & Rosenfeld 01; Keller & Lapata 03)
5
Page Hits as a Proxy for n-gram Frequencies Plausibility: (Keller & Lapata 03) demonstrate a high correlation between: page hits and corpus bigram frequencies page hits and human plausibility judgments Web as a baseline (Lapata & Keller 05): machine translation candidate selection, spelling correction, adjective ordering, article generation, noun compound bracketing, noun compound interpretation, countability detection and prepositional phrase attachment More than a baseline State of the art results for noun compound bracketing (Nakov & Hearst 05)
6
Web Count Problems (1) Page hits are not really n-gram frequencies This may be OK (Keller&Lapata,2003) The Web lacks linguistic annotation Cannot handle stem cells VERB PREPOSITION brain protein synthesis’ inhibition Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences
7
Web Count Problems (2) Instability of the n-gram counts Dynamics over time Query inconsistencies Indexes spread across multiple machines Multiple (inconsistent) index copies Search engine “dancing”, tool at: http://www.seochat.com/googledancehttp://www.seochat.com/googledance Problem: Web experiments are not reproducible.
8
Web Count Problems (3) Rounding of page hits Exact estimates MSN: always Google and Yahoo: for small numbers only Possible reasons for rounding Not necessary for typical users Expensive to compute: Distributed index Constant changes Under high loads, search engines probably sample from their indexes.
9
The Task Problem: What is the impact of n-gram variability (inconsistencies, rounding etc.)? Approach NOT absolute n-gram variability BUT experiments wrt. a real task noun compound bracketing allows for the use of n-grams of different lengths
10
Our Particular Task: Noun Compound Bracketing
11
(a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver. liver cell lineliver cell antibody
12
Related Work Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given Pr that w 1 precedes w 2
13
Adjacency & Dependency (1) right bracketing: [w 1 [w 2 w 3 ] ] w 2 w 3 is a compound (modified by w 1 ) home health care w 1 and w 2 independently modify w 3 adult male rat left bracketing : [ [w 1 w 2 ] w 3 ] only 1 modificational choice possible law enforcement officer w 1 w 2 w 3
14
Adjacency & Dependency (2) right bracketing: [w 1 [w 2 w 3 ] ] w 2 w 3 is a compound (modified by w 1 ) w 1 and w 2 independently modify w 3 adjacency model Is w 2 w 3 a compound? (vs. w 1 w 2 being a compound) dependency model Does w 1 modify w 3 ? (vs. w 1 modifying w 2 ) w 1 w 2 w 3
15
Frequencies Adjacency model Compare #(w 1,w 2 ) to #(w 2,w 3 ) Dependency model Compare #(w 1,w 2 ) to #(w 1,w 3 ) right left w 1 w 2 w 3 Frequency of w 1 w 2
16
Probabilities Adjacency model Compare Pr(w 1 w 2 |w 2 ) to Pr(w 2 w 3 |w 3 ) Dependency model Compare Pr(w 1 w 2 |w 2 ) to Pr(w 1 w 3 |w 3 ) Pr that w 1 modifies w 2
17
Probabilities: Dependency Dependency model Pr(left) Pr(w 1 w 2 |w 2 )Pr(w 2 w 3 |w 3 ) Pr(right) Pr(w 1 w 3 |w 3 )Pr(w 2 w 3 |w 3 ) So we compare Pr(w 1 w 2 |w 2 ) to Pr(w 1 w 3 |w 3 ) BUT! No cancellation in Lauer’s model: w 1 w 2 w 3 left right
18
Probabilities: Estimation Using page hits as a proxy for n-gram counts Pr(w 1 w 2 |w 2 ) = #(w 1, w 2 ) / #(w 2 ) #(w 2 ) word frequency; query for “w 2 ” #(w 1, w 2 ) bigram frequency; query for “w 1 w 2 ” smoothed by 0.5
19
Probabilities: Why? Why should we use: (a) Pr(w 1 w 2 |w 2 ), rather than (b) Pr(w 2 w 1 |w 1 )? Keller&Lapata (2004) calculate: AltaVista queries: (a): 70.49% (b): 68.85% British National Corpus: (a): 63.11% (b): 65.57%
20
Probabilities: Why? (2) Why should we use: (a) Pr(w 1 w 2 |w 2 ), rather than (b) Pr(w 2 w 1 |w 1 )? Maybe to introduce a bracketing prior. Just like Lauer (1995) did. But otherwise, no reason to prefer either one. Do we need probabilities? (association is OK) Do we need a directed model? (symmetry is OK)
21
Association Models: 2 (Chi Squared) A = #(w i, w j ) B = #(w i ) – #(w i, w j ) C = #(w j ) – #(w i, w j ) D = N – (A+B+C) N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words
22
Web-derived Surface Features: Possessive Marker Attached to the first word brain’s stem cell right Attached to the second word brain stem’s cell left We can query directly for possessives Search engines drop the possessive marker, but s is kept. Still, we cannot query for “brain stems’ cell”
23
Other Web-derived Features: Abbreviation After the second word tumor necrosis (TN) factor left After the third word tumor necrosis factor (NF) right We query for e.g., “tumor necrosis tn factor” Problems: Roman digits: IV, vii States: CA Short words: me
24
Other Web-derived Features: Concatenation Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812 Adjacency model healthcare vs. carereform Dependency model healthcare vs. healthreform Triples (adjacency) “healthcare reform” vs. “health carereform”
25
Other Web-derived Features: Reorder Reorders for “health care reform” “care reform health” right “reform health care” left
26
Other Web-derived Features: Internal Inflection Variability First word bone mineral density bones mineral density Second word bone mineral density bone minerals density right left
27
Experiments
28
Lauer (95) dataset 244 noun compounds (NCs) from Grolier’s encyclopedia inter-annotator agreement: 81.5% Exact phrase queries (min freq. 5) Inflections: Carroll’s morphological tools
29
Experiments 4 dimensions: time search engine language filter inflected forms
30
Comparison over time (P): Google Precision (in %) for any language, no inflections. Average recall is shown in parentheses. Varying time intervals, in case index changes happen periodically
31
Comparison over time (P): MSN Precision (in %) for any language, no inflections. Average recall is shown in parentheses. Statistically significant
32
Experiments 4 dimensions: time search engine language filter inflected forms
33
Comparison by search engine (P): for 6/6/2005 Precision (in %) for any language, no inflections. for 6/6/2005 Average recall is shown in parentheses. Statistically significant
34
Comparison by search engine (R): for 6/6/2005 Recall (in %) for any language, no inflections. for 6/6/2005 No much variability of recall (but Google has the biggest index)
35
Experiments 4 dimensions: time search engine language filter inflected forms
36
Comparison by language (P): any language vs. English Precision (in %), no inflections. for 6/6/2005 Minor inconsistent impact on precision
37
Comparison by language (R): any language vs. English Recall (in %), no inflections. for 6/6/2005 Minor consistent drop in recall
38
Experiments 4 dimensions: search engine time language filter inflected forms
39
Comparison by search engine (P): inflections Precision (in %), any language. for 6/6/2005 Minor inconsistent impact on precision
40
Comparison by search engine (R): inflections Recall (in %), any language. for 6/6/2005 Minor consistent improvement in recall
41
Conclusions and Future Work Good news: n-gram variability does not have a statistically significant impact on the performance (for our task). Future work other NLP tasks other languages
42
The End Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.