Effective Query Formulation with Multiple Information Sources

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

1 Fuchun Peng Microsoft Bing 7/23/  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as.

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Information Retrieval Models: Probabilistic Models

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

A Markov Random Field Model for Term Dependencies Chetan Mishra CS 6501 Paper Presentation Ideas, graphs, charts, and results from paper of same name by.

Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

1 Tensor Query Expansion: a cognitively motivated relevance model Mike Symonds, Peter Bruza, Laurianne Sitbon and Ian Turner Queensland University of Technology.

Integrating term dependencies according to their utility Jian-Yun Nie University of Montreal 1.

Semantic (Language) Models: Robustness, Structure & Beyond Thomas Hofmann Department of Computer Science Brown University Chief Scientist.

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Information Retrieval in Practice

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.

1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Chapter 23: Probabilistic Language Models April 13, 2004.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Vector Space Models.

Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.

Discovering Key Concepts in Verbose Queries Michael Bendersky and W. Bruce Croft University of Massachusetts SIGIR 2008.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media J. Bian, Y. Liu, E. Agichtein, and H. Zha ACM WWW, 2008.

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR

1 CS 430: Information Discovery Lecture 5 Ranking.

Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Chinese Academy of Sciences, Beijing, China

Compact Query Term Selection Using Topically Related Text

Language Models for Information Retrieval

A Markov Random Field Model for Term Dependencies

Language Model Approach to IR

Junghoo “John” Cho UCLA

Presentation transcript:

Effective Query Formulation with Multiple Information Sources Michael Bendersky1, Donald Metzler2, W.Bruce Croft1 1University of Massachusetts 2Information Sciences Institute, USC WSDM 2012 Best Paper Runner Up Presented by Tom March 14th, 2012 Good morning everybody, today my presentation starts with this paper “Effective Query Formulation with Multiple Information Sources”, this is WSDM 2012’s best paper runner up. The 1st author Michael bendersky and the 3rd author bruce croft are from UMass, and the 2nd author is from USC

Michael Bendersky Donald Metzler Graduate in 2007 Yahoo! Research USC Supervisor Supervisor Before we look into the detail of this paper, let me introduce the social graph to you. Bruce Croft is an expert in IR field, and both Michael and Donald are his students. Michael Bendersky is a PhD student, and Donald graduated in 2007. He first joined Yahoo! Research, and now is with USC. W. Bruce Croft

A Markov Random Field Model for Term Dependencies, SIGIR, 2005 Inheritance Learning Concept Importance Using a Weighted Dependence Model, WSDM, 2010 Inheritance Parameterized Concept Weighting in Verbose Queries, SIGIR, 2011, Honorable Mention Award Inheritance Effective Query Formulation with Multiple Information Sources, WSDM 2012, Best Paper Runner Up When I read this WSDM 2012 paper, I found there are several precedent works conducted by authors. So I trace back the history of all papers. I found they also published on similar topics at SIGIR 2011, WSDM 2010 and SIGIR 2005. Today I will try to make a summary of their works.

Outline Query Formulation Process Concept-Based Ranking Experiments Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion This is the outline for today’s presentation. First, I will introduce the idea of query formulation process. Then, I will introduce concept-based ranking. Specifically, I will present concept matching, and concept weighting. I will present four methods for concept weighting. Then, I will show some experimental evaluations. I will conclude the presentation with discussions.

Outline Query Formulation Process Concept-Based Ranking Experiments Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion Let’s start with query formulation process.

Query Formulation Process When we type a keyword query to search engine, search engine will do three steps to obtain results: 1st is query refinement, 2nd is structured query formulation, the last step is ranking with scores.

Query Formulation Process Query Refinement Alter the query on the morphological level Tokenization 香港中文大学(CUHK)||何善衡(Ho Sin-Hang)||大楼(Building) Spelling corrections E.g. Hong Kng -> Hong Kong Stemming Query refinement processes alter the query on the morphological level. For example, tokenization, spelling corrections and stemming. Tokenization means segmenting characters into tokens, e.g., we can segment 香港中文大学何善衡大楼 into 香港中文大学，何善衡，大楼. Spelling correction will correct wrong spellings.

Query Formulation Process Structured Query Formulation Concept Identification What are the atomic matching units in the query? Concept Weighting How important are the different concepts for conveying query intent? Query Expansion What additional concepts should be associated with the query? Structured query formulation consists of several steps. First step is concept identification, it solves the problem of What are the atomic matching units in the query? Second step is Concept Weighting, and it solves the problem of How important are the different concepts for conveying query intent? Third step is query expansion, and it studies What additional concepts should be associated with the query?

Structured Query Formulation ER TV Show (ER is an American medical drama television series ) Concept Weighting Query Expansion Terms 0.297 er 0.168 tv 0.192 show 0.051 er tv 0.012 tv show 0.085 season 0.065 episode 0.051 dr 0.043 drama 0.036 series Concept Identification Query Expansion I will present an example of Structured Query Formulation. Given the query “ER TV Show”, search engine may do Structured Query Formulation as follows. Firstly, we identify atomic concepts, and here we use unigram and bigram. Secondly, we calculate concept weight for each concept. Thirdly, we expand query using related concepts.

Query Formulation Process In this paper, we focus on “Structured Query Formulation”.

Outline Query Formulation Process Concept-Based Ranking Experiments Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Now I will introduce the meaning of concept based matching and Concept Matching.

Concept-Based Ranking Concept Weighting Query Concept Matching Document Concepts This is the formula for Concept-Based Ranking. Here sc() is a function to calculate score between a query Q and a document D. We add up contributions from each concept. For each concept, we multiply two scores, one is concept weighting score, which calculates how important the concept is for conveying user’s intent. Concept matching score measures how a document matches with a concept.

Concept Matching Assign score to the matches of concept k in document D Monotonic function: value increases with the number of times concept k matches document D Language model Concept matching will Assign score to the matches of concept k in document D. Usually it is a monotonic function, so value increases with the number of times concept k matches document D. Usually we can use a language model approach. In this model, tf is frequency, C is collection, D is a document, µ is a parameter. This is a standard approach. tf is frequency, C is collection, D is a document, µ is a parameter

Outline Query Formulation Process Concept-Based Ranking Experiments Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion Now I will introduce four different concept weighting methods.

Markov Random Field Markov Random Field Undirected graphical models that define a joint probability distribution over a set of random variables Node represent random variable, and edge represent dependence semantics Information Retrieval Document random variable D, query term random variable Q The 1st method is based on Markov Random Field. Markov Random Field are Undirected graphical models that define a joint probability distribution over a set of random variables. In a graph, Node represent random variable, and edge represent dependence semantics. When we apply MRF in IR, we will consider Document random variable D, query term random variable Q.

Sequential Dependence Model Sequential dependence model places edges between adjacent query terms The first concept weighting model is called Sequential Dependence Model. It places edges between adjacent query terms. The meaning is the sequential query terms are semantically dependent. Markov random field model for three query terms under the sequential dependence assumption

Sequential Dependence Model Query Term Concept: individual query word Phrase Concept: adjacent query word pairs matched as exact phrases in the document Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document The is the formula for Sequential Dependence Model. This model has three parts, the 1st is Query Term Concept: individual query word. The 2nd is Phrase Concept: adjacent query word pairs matched as exact phrases in the document. The 3rd is Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document. Here lambda T lambda O and lambda U are parameters for each type. So All matches of the same type are treated as being equally important. Empirically, weights are set as 0.8, 0.1, 0.1 respectively. All matches of the same type are treated as being equally important Concept weight, set to 0.8, 0.1, 0.1 respectively

Weighted Sequential Dependence SD treat matches of the same type equally Desire to weight a priori over different terms and bigrams differently based on query-level evidence Assume the concept weight parameter λ take on a parameterized form SD treat matches of the same type equally, however, it is desire to weight a priori over different terms and bigrams differently based on query-level evidence. Thus, Weighted Sequential Dependence model assumes the concept weight parameter λ for each concept takes on a parameterized form

Weighted Sequential Dependence Features defined over unigram Features defined over bigram This is the parameterized form of concept weight. Here g u are Features defined over unigram, and g b are features defined over bigram. w are free parameters that must be estimated w are free parameters that must be estimated

Weighted Sequential Dependence Concept Importance Features Endogenous: collection dependent Exogenous: collection independent, estimated from external data sources To calculate Concept Importance Features, we consider two sources. The 1st is endogenous, which is collection dependent. The 2nd is Exogenous, which are collection independent, estimated from external data sources. This table presents details for concept importance features. For example.

Weighted Sequential Dependence Parameter Estimation Coordinate-level ascent Iteratively optimize a multivariate objective function by performing a series of one-dimensional line searches Repeat cycles through each parameter Process is performed iteratively until the gain in the target metric is below a certain threshold Metzler and Croft 2007 To estimate parameters, we employ a Coordinate-level ascent. This method Iteratively optimizes a multivariate objective function by performing a series of one-dimensional line searches, and it Repeats cycles through each parameter, The whole process is performed iteratively until the gain in the target metric is below a certain threshold. For details, please refer to Metzler and Croft 2007

Parameterized Query Expansion WSD learns weights only for the explicit query concepts (concept appears in query), not for latent concepts that are associated with the query through pseudo-relevance feedback PQE uses four types of concepts Query term Phrase concept Proximity concept Expansion concept Top-K terms associated with the query through pseudo-relevance feedback Using Latent Concept Expansion (Metzler and Croft 2007)

Parameterized Query Expansion Latent Concept Expansion Use explicit concepts to retrieve a set of documents R (pseudo-relevant documents) Estimate the weight of each term in R to be an expansion concept Dampen scores of common terms Document relevance Weight of term in pseudo-relevant set

Parameterized Query Expansion Two stage optimization for estimating parameters a1-a5 is 1st stage A6-a7 is 2nd stage

Multiple Source Formulation LCE and PQE use single source for expansion, may lead to topic drift Folge and selbst are not English words. Bisexual and film are not talking about the same topic with ER TV Show.

Multiple Source Formulation Expansion Term Ranking documents in each source σ using ranking function using explicit concept M terms with highest value of LCE for each source σ are added to Assign a weight to each term in , using the weighted combination of expansion scores

Multiple Source Formulation Explicit Concept Multiple sources Expansion Concept

Multiple Source Formulation

Multiple Source Formulation

Multiple Source Formulation

Multiple Source Formulation

Multiple Source Formulation

Outline Query Formulation Process Concept-Based Ranking Experiments Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion Now I will present some experimental results

Experiments Newswire & Web TREC collections ROBUST04 (500K documents) GOV2 (25M documents) ClueWeb-B (50M documents) <title> & <desc> portions of TREC topics 3-fold cross-validation

Experiments Comparison with the query weighting methods on TREC collections Significance test over each baseline is presented

Experiments Statistically indistinguishable from other methods Comparison with the query expansion methods on TREC collections

Experiments Other experiments in WSDM2012 paper Varying the number of expansion terms Robustness of proposed methods Result diversification performance

Discussion The problems solved in these papers are fundamentally important Written in a good style General formulation -> specific algorithm Cite related work throughout the paper, 旁征博引 Motivate the proposed approach from time to time Experiments on standard data sets, and quite thorough

Thanks! Q & A