Effective Query Formulation with Multiple Information Sources

Name: Effective Query Formulation with Multiple Information Sources
Uploaded: 2017-12-15T22:14:10+00:00
Duration: PTM18S30
Channel: Liliana Neal
Description: Effective Query Formulation with Multiple Information Sources

Effective Query Formulation with Multiple Information Sources
Michael Bendersky1, Donald Metzler2, W.Bruce Croft1 1University of Massachusetts 2Information Sciences Institute, USC WSDM 2012 Best Paper Runner Up Presented by Tom March 14th, 2012 Good morning everybody, today my presentation starts with this paper “Effective Query Formulation with Multiple Information Sources”, this is WSDM 2012’s best paper runner up. The 1st author Michael bendersky and the 3rd author bruce croft are from UMass, and the 2nd author is from USC

Michael Bendersky Donald Metzler Graduate in 2007 Yahoo! Research USC
Supervisor Supervisor Before we look into the detail of this paper, let me introduce the social graph to you. Bruce Croft is an expert in IR field, and both Michael and Donald are his students. Michael Bendersky is a PhD student, and Donald graduated in He first joined Yahoo! Research, and now is with USC. W. Bruce Croft

A Markov Random Field Model for Term Dependencies, SIGIR, 2005
Inheritance Learning Concept Importance Using a Weighted Dependence Model, WSDM, 2010 Inheritance Parameterized Concept Weighting in Verbose Queries, SIGIR, 2011, Honorable Mention Award Inheritance Effective Query Formulation with Multiple Information Sources, WSDM 2012, Best Paper Runner Up When I read this WSDM 2012 paper, I found there are several precedent works conducted by authors. So I trace back the history of all papers. I found they also published on similar topics at SIGIR 2011, WSDM 2010 and SIGIR Today I will try to make a summary of their works.

Outline Query Formulation Process Concept-Based Ranking Experiments
Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion This is the outline for today’s presentation. First, I will introduce the idea of query formulation process. Then, I will introduce concept-based ranking. Specifically, I will present concept matching, and concept weighting. I will present four methods for concept weighting. Then, I will show some experimental evaluations. I will conclude the presentation with discussions.

Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion Let’s start with query formulation process.

Query Formulation Process
When we type a keyword query to search engine, search engine will do three steps to obtain results: 1st is query refinement, 2nd is structured query formulation, the last step is ranking with scores.

Query Refinement Alter the query on the morphological level Tokenization 香港中文大学(CUHK)||何善衡(Ho Sin-Hang)||大楼(Building) Spelling corrections E.g. Hong Kng -> Hong Kong Stemming Query refinement processes alter the query on the morphological level. For example, tokenization, spelling corrections and stemming. Tokenization means segmenting characters into tokens, e.g., we can segment 香港中文大学何善衡大楼 into 香港中文大学，何善衡，大楼. Spelling correction will correct wrong spellings.

Structured Query Formulation Concept Identification What are the atomic matching units in the query? Concept Weighting How important are the different concepts for conveying query intent? Query Expansion What additional concepts should be associated with the query? Structured query formulation consists of several steps. First step is concept identification, it solves the problem of What are the atomic matching units in the query? Second step is Concept Weighting, and it solves the problem of How important are the different concepts for conveying query intent? Third step is query expansion, and it studies What additional concepts should be associated with the query?

Structured Query Formulation
ER TV Show (ER is an American medical drama television series ) Concept Weighting Query Expansion Terms 0.297 er 0.168 tv 0.192 show 0.051 er tv 0.012 tv show 0.085 season 0.065 episode 0.051 dr 0.043 drama 0.036 series Concept Identification Query Expansion I will present an example of Structured Query Formulation. Given the query “ER TV Show”, search engine may do Structured Query Formulation as follows. Firstly, we identify atomic concepts, and here we use unigram and bigram. Secondly, we calculate concept weight for each concept. Thirdly, we expand query using related concepts.

In this paper, we focus on “Structured Query Formulation”.

Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Now I will introduce the meaning of concept based matching and Concept Matching.

Concept-Based Ranking
Concept Weighting Query Concept Matching Document Concepts This is the formula for Concept-Based Ranking. Here sc() is a function to calculate score between a query Q and a document D. We add up contributions from each concept. For each concept, we multiply two scores, one is concept weighting score, which calculates how important the concept is for conveying user’s intent. Concept matching score measures how a document matches with a concept.

Concept Matching Assign score to the matches of concept k in document D Monotonic function: value increases with the number of times concept k matches document D Language model Concept matching will Assign score to the matches of concept k in document D. Usually it is a monotonic function, so value increases with the number of times concept k matches document D. Usually we can use a language model approach. In this model, tf is frequency, C is collection, D is a document, µ is a parameter. This is a standard approach. tf is frequency, C is collection, D is a document, µ is a parameter

Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion Now I will introduce four different concept weighting methods.

Markov Random Field Markov Random Field
Undirected graphical models that define a joint probability distribution over a set of random variables Node represent random variable, and edge represent dependence semantics Information Retrieval Document random variable D, query term random variable Q The 1st method is based on Markov Random Field. Markov Random Field are Undirected graphical models that define a joint probability distribution over a set of random variables. In a graph, Node represent random variable, and edge represent dependence semantics. When we apply MRF in IR, we will consider Document random variable D, query term random variable Q.

Sequential Dependence Model
Sequential dependence model places edges between adjacent query terms The first concept weighting model is called Sequential Dependence Model. It places edges between adjacent query terms. The meaning is the sequential query terms are semantically dependent. Markov random field model for three query terms under the sequential dependence assumption

Sequential Dependence Model
Query Term Concept: individual query word Phrase Concept: adjacent query word pairs matched as exact phrases in the document Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document The is the formula for Sequential Dependence Model. This model has three parts, the 1st is Query Term Concept: individual query word. The 2nd is Phrase Concept: adjacent query word pairs matched as exact phrases in the document. The 3rd is Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document. Here lambda T lambda O and lambda U are parameters for each type. So All matches of the same type are treated as being equally important. Empirically, weights are set as 0.8, 0.1, 0.1 respectively. All matches of the same type are treated as being equally important Concept weight, set to 0.8, 0.1, 0.1 respectively

Weighted Sequential Dependence
SD treat matches of the same type equally Desire to weight a priori over different terms and bigrams differently based on query-level evidence Assume the concept weight parameter λ take on a parameterized form SD treat matches of the same type equally, however, it is desire to weight a priori over different terms and bigrams differently based on query-level evidence. Thus, Weighted Sequential Dependence model assumes the concept weight parameter λ for each concept takes on a parameterized form

Features defined over unigram Features defined over bigram This is the parameterized form of concept weight. Here g u are Features defined over unigram, and g b are features defined over bigram. w are free parameters that must be estimated w are free parameters that must be estimated

Concept Importance Features Endogenous: collection dependent Exogenous: collection independent, estimated from external data sources To calculate Concept Importance Features, we consider two sources. The 1st is endogenous, which is collection dependent. The 2nd is Exogenous, which are collection independent, estimated from external data sources. This table presents details for concept importance features. For example.

Parameter Estimation Coordinate-level ascent Iteratively optimize a multivariate objective function by performing a series of one-dimensional line searches Repeat cycles through each parameter Process is performed iteratively until the gain in the target metric is below a certain threshold Metzler and Croft 2007 To estimate parameters, we employ a Coordinate-level ascent. This method Iteratively optimizes a multivariate objective function by performing a series of one-dimensional line searches, and it Repeats cycles through each parameter, The whole process is performed iteratively until the gain in the target metric is below a certain threshold. For details, please refer to Metzler and Croft 2007

Parameterized Query Expansion
WSD learns weights only for the explicit query concepts (concept appears in query), not for latent concepts that are associated with the query through pseudo-relevance feedback PQE uses four types of concepts Query term Phrase concept Proximity concept Expansion concept Top-K terms associated with the query through pseudo-relevance feedback Using Latent Concept Expansion (Metzler and Croft 2007)

Latent Concept Expansion Use explicit concepts to retrieve a set of documents R (pseudo-relevant documents) Estimate the weight of each term in R to be an expansion concept Dampen scores of common terms Document relevance Weight of term in pseudo-relevant set

Two stage optimization for estimating parameters a1-a5 is 1st stage A6-a7 is 2nd stage

Multiple Source Formulation
LCE and PQE use single source for expansion, may lead to topic drift Folge and selbst are not English words. Bisexual and film are not talking about the same topic with ER TV Show.

Expansion Term Ranking documents in each source σ using ranking function using explicit concept M terms with highest value of LCE for each source σ are added to Assign a weight to each term in , using the weighted combination of expansion scores

Explicit Concept Multiple sources Expansion Concept

Concept Matching Concept Weighting Sequential Dependence [SIGIR 2005] Weighted Sequential Dependence [WSDM 2010] Parameterized Query Expansion [SIGIR 2011] Multiple Source Formulation [WSDM 2012] Experiments Discussion Now I will present some experimental results

Experiments Newswire & Web TREC collections
ROBUST04 (500K documents) GOV2 (25M documents) ClueWeb-B (50M documents) <title> & <desc> portions of TREC topics 3-fold cross-validation

Experiments Comparison with the query weighting methods on TREC collections Significance test over each baseline is presented

Experiments Statistically indistinguishable from other methods
Comparison with the query expansion methods on TREC collections

Experiments Other experiments in WSDM2012 paper
Varying the number of expansion terms Robustness of proposed methods Result diversification performance

Discussion The problems solved in these papers are fundamentally important Written in a good style General formulation -> specific algorithm Cite related work throughout the paper, 旁征博引 Motivate the proposed approach from time to time Experiments on standard data sets, and quite thorough

Thanks! Q & A

Effective Query Formulation with Multiple Information Sources

Similar presentations

Presentation on theme: "Effective Query Formulation with Multiple Information Sources"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Effective Query Formulation with Multiple Information Sources

Similar presentations

Presentation on theme: "Effective Query Formulation with Multiple Information Sources"— Presentation transcript:

Similar presentations

About project

Feedback