1 Fuchun Peng Microsoft Bing 7/23/2010. 2  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Language Modeling Approaches for Information Retrieval Rong Jin.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Ensemble Learning (2), Tree and Forest
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Learning to Rank for Information Retrieval
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Text Classification, Active/Interactive learning.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Effective Query Formulation with Multiple Information Sources
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Supervised IR Incremental user feedback (Relevance Feedback) OR Initial fixed training sets –User tags relevant/irrelevant –Routing problem  initial class.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Statistical Learning Dong Liu Dept. EEIS, USTC.
Topic Models in Text Processing
INF 141: Information Retrieval
Learning to Rank with Ties
Presentation transcript:

1 Fuchun Peng Microsoft Bing 7/23/2010

2  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as building blocks simmons college’s Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” sports psychology (course) Can we automatically segment the query to recover the concepts?

3  Summary of Segmentation approaches  Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features  Conclusions

4  Supervised learning (Bergsma et al, EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation point ◦ Features: POS, web counts, the, and, … w1w1 w2w2 w3w3 w4w4 w5w5 N N Y Y Problem: –Limited-range context –Features specifically designed for noun phrases

 Manual Data Preparation ◦ Linguistic driven  [San jose international airport] ◦ Relevance driven  [San jose] [international airport] 5

6 w1w1 w2w2 w3w3 w4w4 w5w5 MI 1,2 2,3 3,4 4,5 threshold MI(w1,w2) = P(w 1 w 2 ) / P(w 1 )P(w 2 ) insert segment boundary w 1 w 2 | w 3 w 4 w 5 Problem: –only captures short-range correlation (between adjacent words) –What about my heart will go on? Iterative update

7

8  Assume the query is generated by independent sampling from a probability distribution of concepts: simmons college sports psychology unigram model P(simmons college)= P(sports psychology)= P= × simmons college sports psychology P(simmons)= P(college sports)= P(psychology)= P= × × > Enumerate all possible segmentations; Rank by probability of being generated by the unigram model How to estimate parameters P(w) for the unigram model?

9  We have ngram (n=1..5) counts in a web corpus ◦ 464M documents; L = 33B tokens ◦ Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399]  #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP

10  Maximum Likelihood Estimate: P MLE (t) = #(t) / N  Problem: ◦ #(potter and the goblet of) = 6765 ◦ P(potter and the goblet of) > P(harry potter and the goblet of fire)? Wrong! ◦ not prob. of seeing t in text, but prob. of seeing t as a self-contained concept in text

11 Query-relevant web corpus Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) θ = argmax P(D|θ)P(θ) = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑ t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) posterior prob. DL of corpusDL of parameters ngram longest matching count raw frequency harry harry potter harry potter and harry potter and the harry potter and the goblet harry potter and the goblet of harry potter and the goblet of fire... … fire … …

12

13  Three human-segmented datasets ◦ 3 data sets, for training, validation, and testing, 500 queries for each set  Segmented by three editors A, B, C

14  Evaluation metric: ◦ Boundary classification accuracy ◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy ◦ Segment accuracy: the percentage of segments being recovered  Truth [abc] [de] [fg]  Prediction: [abc] [de fg]: precision w1w1 w2w2 w3w3 w4w4 w5w5 N N Y Y

15

16

17  Summary of Segmentation approaches  Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features  Conclusions

 Phrase Proximity Boosting  Phrase Level Query Expansion 18

 Classifying a segment into one of three categories ◦ Strong concept: no word reordering, no word insertion/deletion  Treat the whole segment as a single unit in matching and ranking ◦ Weak concept: allow word reordering or deletion/insertion  Boost documents matching the weak concepts ◦ Not a concept  Do nothing 19

 Concept based BM25 ◦ Weighted by the confidence of concepts  Concept based min coverage ◦ Weighted by the confidence of concepts 20

 Phrase level replacement ◦ [San Francisco] -> [sf] ◦ [red eye flight] ->[late night flight] 21

 Significant relevance boosting ◦ Affects 40% query traffic ◦ Significant DCG gain (1.5% for affected queries) ◦ Significant online CTR gain (0.5% over all) 22

23  Summary of Segmentation approaches  Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features  Conclusions

 Data is segmentation is important for query segmentation  Phrases are important for improving relevance 24

 Bergsma et al, EMNLP-CoNLL07  Risvik et al. WWW 2003  Hagen et al SIGIR 2010  Tan & Peng, WWW

26

27  Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments Technical difficulties harry potter and the goblet of fire += 1 potter and the goblet of += 0 C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling |...

28  Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches)... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... Q=harry potter and the goblet of fire harry potter and the goblet of fire += 1 the += 2 harry potter += 1

29

30  Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches)... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... Q= potter and the goblet potter and the goblet += 1 the += 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q| 2 )