Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University 2010.7.27.

Slides:

Advertisements

Similar presentations

KDD 2011 Summary of Text Mining sessions Hongbo Deng.

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.

Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.

Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.

Lecture 21: Spectral Clustering

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Learning Proximity Relations Defined by Linear Combinations of Constrained Random Walks William W. Cohen Machine Learning Department and Language Technologies.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Link Analysis, PageRank and Search Engines on the Web

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.

1 Fast Incremental Proximity Search in Large Graphs Purnamrita Sarkar Andrew W. Moore Amit Prakash.

Learning Relationships Defined by Linear Combinations of Constrained Random Walks William W. Cohen Machine Learning Department and Language Technologies.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

Fast Random Walk with Restart and Its Applications

Scalable Text Mining with Sparse Generative Models

Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta Amit Pathak Dr. Soumen Chakrabarti IIT Bombay.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Algorithmic Detection of Semantic Similarity WWW 2005.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.

Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Center-Piece Subgraphs: Problem definition and Fast Solutions Hanghang Tong Christos Faloutsos Carnegie Mellon University.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Neighborhood - based Tag Prediction

Link Analysis 2 Page Rank Variants

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Applying Key Phrase Extraction to aid Invalidity Search

Author: Kazunari Sugiyama, etc. (WWW2004)

Learning to Rank Typed Graph Walks: Local and Global Approaches

Presentation transcript:

Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University

Carnegie Mellon 22 Outline Background –Retrieval models based on Path Constrained Random Walks (ECML PKDD 2010) Efficient PCRW –Comparing sampling and truncation strategies

Carnegie Mellon 33 Relational Retrieval Problems Data of many retrieval/recommendation tasks can be represented as labeled directed graphs –Typed nodes: documents, terms, metadata –Labeled edges: authorOf, datePublished Can support a family of typed proximity queries –ad hoc retrieval: term nodes  documents –gene recommendation :user, year  gene –Reference (citation) recommendation: topic  paper –Expert ﬁnding: topic  user –Collaborator recommendation : scientist  scientist How to measure the proximity between query and target nodes?

Carnegie Mellon Biology Literature Data Data of this study –Yeast: 0.2M nodes, 5.5M links –Fly: 0.8M nodes, 3.5M links –E.g. the fly graph Tasks –Gene recommendation:author, year  gene –Reference recommendation:title words,year  paper –Expert-finding:title words, genes  author 4

Carnegie Mellon 55 Random Walks with Restart (RWR) as A Proximity Measured RWR is a commonly used similarity measure on Labeled Graphs –Topic-sensitive Pagerank (Haveliwala, 2002) –Personalized Pagerank (Jeh &. Widom, 2003) –ObjectRank (Balmin et al., 2004), –Personal information management (Minkov & Cohen, 2007) RWR can be improved by supervised learning of edge weights –quadratic programming (Tsoi et al., 2003), –simulated annealing (Nie et al., 2005), –back-propagation (Diligenti et al., 2005; Minkov & Cohen, 2007), –limit memory Newton method (Agarwal et al., 2006)

Carnegie Mellon 66 The Limitation of RWR One-parameter-per-edge label is limited because the context in which an edge label appears is ignored –E.g. (observed from real data) PathComments Don't read about genes which I have already read Read about my favorite authors PathComments Read about the genes that I am working on Don't need to read paper from my own lab

Carnegie Mellon 77 Path Constrained Random Walk –A New Proximity Measure Our previous work (Lao& Cohen, ECML 2010) –learn a weighted combination of simple “path experts”, each of which corresponds to a particular labeled path through the graph Reference recommendation--an example –In the TREC-CHEM Prior Art Search Task, researchers found that it is more effective to ﬁrst ﬁnd patents about the topic, then aggregate their citations –Our proposed model can discover this kind of retrieval schemes and assign proper weights to combine them. E.g. Weight Path

Carnegie Mellon 88 Definitions An Entity-Relation graph G=(T,E,R), is –a set of entities types T={T} –a set of entities E={e}, Each entity is typed with e.T T –a set of relations R={R} A Relation path P=(R 1, …,R n ) is a sequence of relations –E.g. Path Constrained Random Walk –Given a query q=(E q,T q ) –Recursively define a distribution for each path

Carnegie Mellon Supervised Retrieval Model Based on PCRW A retrieval model can rank target entities by linearly combine the distributions of different paths –or in matrix form s=Aθ This mode can be optimized by maximizing the probability of the observed relevance –Given a set of training data D={(q (m), A (m), y (m) )}, y e (m) =1/0 This PCRW based model can significantly improve retrieval quality over the RWR based models (Lao & Cohen, ECML 2010) 9

Carnegie Mellon 10 Outline Background –Retrieval Models with PCRW (ECML PKDD 2010) Efficient PCRW –Comparing sampling and truncation strategies

Carnegie Mellon The Need for Efficient PCRW Random walk based model can be expensive to execute –Especially for dense graphs, or long random walk paths Popular speedup strategies are –Sampling (finger printing) strategies Fogaras 2004 –Truncation (pruning) strategies Chakrabarti 2007 –Build two-level representations of graphs offline Raghavan et al., 2003, He et al., 2007, Dalvi et al., 2008 Tong et al low-rank matrix approximation of the graph Chakrabarti precompute Personalized Pagerank Vectors (PPVs) for a small fraction of nodes In this study, we will compare different sampling and truncation strategies applied to PCRW 11

Carnegie Mellon 12 Four Strategies for Efficient Random Walks Fingerprinting (sampling) –Simulate a large number of random walkers Fixed Truncation –Truncate by fixed value Beam Truncation –Keep top W probable entities Weighted Particle Filtering –A combination of exact inference and sampling

Carnegie Mellon 13 Weighted Particle Filtering Start from exact inference switch to sampling when the branching is heavy

Carnegie Mellon 14 Experiment Setup Data sources –Yeast and fly data bases Automatically labeled tasks generated from publications –Gene recommendation:author, year  gene –Reference recommendation:title words,year  paper –Expert-finding:title words, genes  author Data split –2000 training, 2000 tuning, 2000 test Time variant graph (for training) –each edge is tagged with a time stamp (year) –When doing random walk, only consider edges that are earlier than the query

Carnegie Mellon 15 Example Features A model trained for reference recommendation task on the yeast data 6) Resembles an ad-hoc retrieval system 1) Papers co-cited with the on-topic papers 7,8) Papers cited during the past two years 9) Well cited papers 12,13) General papers published during the past two years 10,11) (Important) early papers about specific query terms (genes) 14) old papers 2) Aggregated citations of the on-topic papers

Carnegie Mellon Results on The Yeast Data 16 T 0 = 0.17s, L= 3 T 0 = 1.6s, L = 4 T 0 = 2.7s, L= 3

Carnegie Mellon Results on the Fly Data 17 T 0 = 0.15s, L= 3 T 0 = 1.8s, L= 4 T 0 = 0.9s, L= 3

Carnegie Mellon 18 Observations Sampling strategies are more efficient than truncation strategies –At each step, the truncation strategies need to generate exact distribution before truncation Particle ﬁltering produces better MAP than ﬁngerprinting –By reducing the variances of estimations Retrieval quality is improved in some cases –By producing better weights for the model

Carnegie Mellon Summary 19

Carnegie Mellon 20 The End Thanks This work is supported by NSF grant IIS and NIH grant R01GM081293

Carnegie Mellon 21

Carnegie Mellon 22 Some Evidence Compares retrieval qualities with two fixed models but various particle filtering settings at query time The models are trained with exact RW and particle filtering (ε min = 0.001) for the reference recommendation task on Yeast data

Carnegie Mellon Some More Evidence PF keeps the distributions sparse, and reduces the weights of dense features –(Left) shows for each relation path the average number of nodes with non-zero probability –(Right) shows for each relation path the ratio of its weight in the particle ﬁltering model vs. its weight in the exact PCRW model. ‘+’ and ‘−’ indicate positive and negative weights 23