1 Fuchun Peng Microsoft Bing 7/23/2010
2 Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks simmons college’s Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” sports psychology (course) Can we automatically segment the query to recover the concepts?
3 Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features Conclusions
4 Supervised learning (Bergsma et al, EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation point ◦ Features: POS, web counts, the, and, … w1w1 w2w2 w3w3 w4w4 w5w5 N N Y Y Problem: –Limited-range context –Features specifically designed for noun phrases
Manual Data Preparation ◦ Linguistic driven [San jose international airport] ◦ Relevance driven [San jose] [international airport] 5
6 w1w1 w2w2 w3w3 w4w4 w5w5 MI 1,2 2,3 3,4 4,5 threshold MI(w1,w2) = P(w 1 w 2 ) / P(w 1 )P(w 2 ) insert segment boundary w 1 w 2 | w 3 w 4 w 5 Problem: –only captures short-range correlation (between adjacent words) –What about my heart will go on? Iterative update
7
8 Assume the query is generated by independent sampling from a probability distribution of concepts: simmons college sports psychology unigram model P(simmons college)= P(sports psychology)= P= × simmons college sports psychology P(simmons)= P(college sports)= P(psychology)= P= × × > Enumerate all possible segmentations; Rank by probability of being generated by the unigram model How to estimate parameters P(w) for the unigram model?
9 We have ngram (n=1..5) counts in a web corpus ◦ 464M documents; L = 33B tokens ◦ Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP
10 Maximum Likelihood Estimate: P MLE (t) = #(t) / N Problem: ◦ #(potter and the goblet of) = 6765 ◦ P(potter and the goblet of) > P(harry potter and the goblet of fire)? Wrong! ◦ not prob. of seeing t in text, but prob. of seeing t as a self-contained concept in text
11 Query-relevant web corpus Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) θ = argmax P(D|θ)P(θ) = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑ t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) posterior prob. DL of corpusDL of parameters ngram longest matching count raw frequency harry harry potter harry potter and harry potter and the harry potter and the goblet harry potter and the goblet of harry potter and the goblet of fire... … fire … …
12
13 Three human-segmented datasets ◦ 3 data sets, for training, validation, and testing, 500 queries for each set Segmented by three editors A, B, C
14 Evaluation metric: ◦ Boundary classification accuracy ◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy ◦ Segment accuracy: the percentage of segments being recovered Truth [abc] [de] [fg] Prediction: [abc] [de fg]: precision w1w1 w2w2 w3w3 w4w4 w5w5 N N Y Y
15
16
17 Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features Conclusions
Phrase Proximity Boosting Phrase Level Query Expansion 18
Classifying a segment into one of three categories ◦ Strong concept: no word reordering, no word insertion/deletion Treat the whole segment as a single unit in matching and ranking ◦ Weak concept: allow word reordering or deletion/insertion Boost documents matching the weak concepts ◦ Not a concept Do nothing 19
Concept based BM25 ◦ Weighted by the confidence of concepts Concept based min coverage ◦ Weighted by the confidence of concepts 20
Phrase level replacement ◦ [San Francisco] -> [sf] ◦ [red eye flight] ->[late night flight] 21
Significant relevance boosting ◦ Affects 40% query traffic ◦ Significant DCG gain (1.5% for affected queries) ◦ Significant online CTR gain (0.5% over all) 22
23 Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features Conclusions
Data is segmentation is important for query segmentation Phrases are important for improving relevance 24
Bergsma et al, EMNLP-CoNLL07 Risvik et al. WWW 2003 Hagen et al SIGIR 2010 Tan & Peng, WWW
26
27 Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments Technical difficulties harry potter and the goblet of fire += 1 potter and the goblet of += 0 C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling |...
28 Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches)... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... Q=harry potter and the goblet of fire harry potter and the goblet of fire += 1 the += 2 harry potter += 1
29
30 Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches)... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... Q= potter and the goblet potter and the goblet += 1 the += 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q| 2 )