Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Information Retrieval Models: Probabilistic Models

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Evaluating Search Engine

Lecture 21: Spectral Clustering

Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Query Expansion.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

1 Query Operations Relevance Feedback & Query Expansion.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.

Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Generating Query Substitutions Alicia Wood. What is the problem to be solved?

ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.

Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,

Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from

An Efficient Algorithm for Incremental Update of Concept space

Lecture 12: Relevance Feedback & Query Expansion - II

Latent Semantic Indexing

Compact Query Term Selection Using Topically Related Text

Applying Key Phrase Extraction to aid Invalidity Search

SVMs for Document Ranking

Presentation transcript:

Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht

A Problem in Document Retrieval There is a ‘gap’ between search queries and documents. Query: car

A Problem in Document Retrieval There is a ‘gap’ between search queries and documents. Google.com Bing.com Yahoo.com … Query: car

A Problem in Document Retrieval There is a ‘gap’ between search queries and documents. Google.com Bing.com Yahoo.com … Query: car

A Problem in Document Retrieval There is a ‘gap’ between search queries and documents. Google.com Bing.com Yahoo.com … Query: car Good enough?

A Problem in Document Retrieval What about the documents about automobiles, BMW, Benz, …? There are various expressions for a same entities. One solution is to expand the original user queries with some ‘relevant’ terms.

Traditional Query Expansion Methods Human and/or computer generated thesauri – Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts. Human Relevance feedback – Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009). – User click information (Yin et al., ECIR 2009) Automatic query expansion – Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996). Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents.

Traditional Query Expansion Methods Human and/or computer generated thesauri – Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts. Human Relevance feedback – Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009). – User click information (Yin et al., ECIR 2009) Automatic query expansion – Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996). Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents. – Analyze query flow graph in (Bordino et al., SIGIR 2010) Expensive, and time consuming

Traditional Query Expansion Methods Human and/or computer generated thesauri – Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts. Human Relevance feedback – Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009). – User click information (Yin et al., ECIR 2009) Automatic query expansion – Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996). Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents. – Analyze query flow graph in (Bordino et al., SIGIR 2010) Expensive, and time consuming Human Input

Traditional Query Expansion Methods Human and/or computer generated thesauri – Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts. Human Relevance feedback – Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009). – User click information (Yin et al., ECIR 2009) Automatic query expansion – Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996). Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents. – Analyze query flow graph in (Bordino et al., SIGIR 2010) Expensive, and time consuming Human Input

A Different View What we really need here is a way to estimate term-term relevance. Problem of finding expansion terms for user queries  Problem of finding ‘relevant’ terms given a similarity metric. How to derive a term-term similarity metric?

Term-Term Similarity Hypothesis: the metric ‘d’ should be smooth, i.e., d(t1) ~ d(t2) if ‘t 1 ’ and ‘t 2 ’are similar/relevant. Why not graph Laplacian kernels?! – We can easily have smoothness property. – We can also define distance metrics with it.

Define Affinity Graph Nodes are terms Edges are co-occurrences Weights of the edges are the number of documents terms co-occur

Graph Laplacian Kernels General Form Definition: Resistance: Diffusion: P-step Random Walk: … Recall:

Graph Laplacian Kernels General Form Definition: Resistance: Diffusion: P-step Random Walk: … How to choose hyper parameters?

Recall: Graph Laplacian Kernels General Form Definition: Resistance: Diffusion: P-step Random Walk: … How to choose g(λ)? How to choose hyper parameters?

Non-parametric kernel Learn the transformation g(λ) directly from training data. – If we know some terms are similar, we want to maximize their similarities. – At the same time, we want to have a smoother metric.

An Optimization Problem : the set of eigenvalues of original Laplacian graph ‘‘ ‘ ‘‘ t in‘ and t jn’ are pair of similar terms in the training document n’

An Optimization Problem : the set of eigenvalues of original Laplacian graph ‘‘ ‘ ‘‘ t in‘ and t jn’ are pair of similar terms in the training document n’ Maximize for known similar terms t in and t jn

An Optimization Problem : the set of eigenvalues of original Laplacian graph ‘‘ ‘ ‘‘ t in‘ and t jn’ are pair of similar terms in the training document n’ Penalize more for large eigenvalues

Kernel to Distances Given the kernel K, we can define distances between any pair of nodes, d(i,j), in the graph. µ1µ1 µ2µ2 µnµn titi tjtj Recall: We define:

Kernel to Distances Given the kernel K, we can define distances between any pair of nodes, d(i,j), in the graph. µ1µ1 µ2µ2 µnµn titi tjtj Recall: We define: d(i,j) = K ii +K jj -2K ij Euclidean Distance

Kernel to Distances Given the kernel K, we can define distances between any pair of nodes, d(i,j), in the graph. µ1µ1 µ2µ2 µnµn titi tjtj Recall: We define: d(i,j) = K ii +K jj -2K ij Euclidean Distance The distance metric derived from graph Laplacian Kernel is the Euclidean distances in the kernel space

Using term-term similarity in IR Deal with similarity between sets and terms. – In query expansion tasks, set of query terms is ‘S’ and a candidate expansion term is ‘t’. Transform the pair-wise distances, ‘d’, to set- to-term similarity. – Naïve methods: d max =max(d(S,t)) d avg =avg(d(S,t)) d min =min(d(S,t))

Set-to-term Similarity Query collapsing

Query Collapsing We have to compute eigen-decompostion again for each query. – It is too expensive for the online task. Approximation is possible. – We want to approximate the projection of `new’ point ‘S’ in the kernel space. – We need to add one element in the original eigenvector. µ1µ1 µ2µ2 µnµn A E µ1µ1 µ2µ2 µnµn A E S

Nystrőm Approximation For all nodes in the graph Laplacian, we have If the new point s’ was in the graph, it would satisfy the above as well.

Nystrőm Approximation For all nodes in the graph Laplacian, we have If the new point s’ was in the graph, it would satisfy the above as well.

Nystrőm Approximation For all nodes in the graph Laplacian, we have If the new point s’ was in the graph, it would satisfy the above as well.

Evaluation Two tasks: – Term prediction (scientific publication) Give the terms in the abstracts, predict the possible terms in the full body Compare with TFIDF, PRF, PLSI – Query expansion Compare with Lemur/Indri + PRF and Terrier + PRF Kernels: – Diffusion (optimized by line search) – Resistance – Non-parametric (optimized by line search) Set-to-term: – Average – Query collapse

Term prediction 6000 articles about 10 cancers downloaded from PubMed. – 80% as training and 20% as testing Given the terms in abstracts, rank all the candidate terms using the distance metrics. – The smaller the distances between candidate terms and query terms, the higher rank these terms are. Use AUC to evaluate (Joachims, ICML 2005)

Results

Query Expansion Four TREC datasets: Genomic 03 & 04, Adhoc TREC 7 & 8. We built graph using different set of terms in these datasets: – genes/proteins on Genomic 03 data – 5000 terms with highest TFIDF scores on Genomic 04 data. – 25% subsamples from all (~100k) unique terms from TREC7 & 8. Use Mean Average Precision (MAP) to evaluate the performance. Only Resistance Kernel

Results