Learning Relationships Defined by Linear Combinations of Constrained Random Walks William W. Cohen Machine Learning Department and Language Technologies.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
What is Statistical Modeling
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Search Engines and Information Retrieval
Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Learning Proximity Relations Defined by Linear Combinations of Constrained Random Walks William W. Cohen Machine Learning Department and Language Technologies.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Link Analysis, PageRank and Search Engines on the Web
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Scalable Text Mining with Sparse Generative Models
Information Retrieval
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Issues with Data Mining
Search Engines and Information Retrieval Chapter 1.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
William W. Cohen with Ni Lao (Google), Ramnath Balasubramanyan, Dana Moshovitz-Attias School of Computer Science, Carnegie Mellon University, Reasoning.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neighborhood - based Tag Prediction
Sofus A. Macskassy Fetch Technologies
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Panagiotis G. Ipeirotis Luis Gravano
Learning to Rank Typed Graph Walks: Local and Global Approaches
Learning to Rank with Ties
Presentation transcript:

Learning Relationships Defined by Linear Combinations of Constrained Random Walks William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University joint work with: Ni Lao Language Technologies Institute Tom Mitchell Machine Learning Department

Motivation: The simple and the complex In computer science there is a tension between –The elegant, simple and general –The messy, complex and problem-specific Graphs are: –Simple: so they are easy to analyze and store –General: so They appear in many contexts They are often a natural representation of important aspects of information –Well-understood: for instance, Standard techniques like PPR/RWR exist for estimating similarity of two nodes in a graph

Motivation: The simple and the complex The real world is complex… … learning is a way to incorporate that complexity in our models without sacrificing elegance and generality

Motivation: The simple and the complex Graphs are: –Simple: so they are easy to analyze and store –General –Well-understood: for instance, Standard techniques like PPR/RWR exist for estimating similarity of two nodes in a graph In this talk: –Learning similarity-like relationships in graphs, based on RWR/PPR –Several applications

Similarity Queries on Graphs 1) Given type t* and node x in G, find y:T(y)=t* and y~x. 2) Given type t* and node set X, find y:T(y)=t* and y~X. Nearest-neighbor classification: –G contains feature nodes and instance nodes –A link (x,f) means feature f is true for instance x –x* is a query instance, y~x* means y likely of same class as x* Information retrieval: –G contains word nodes and document nodes –A link (w,d) means word w is in document d –X is a set of keywords, y~X means y likely to be relevant to X Database retrieval: –G encodes a database –… ?

BANKS: Browsing and Keyword Search Database is modeled as a graph –Nodes = tuples –Edges = references between tuples edges are directed and indicate foreign key, inclusion dependencies,.. [Aditya et al, VLDB 2002] MultiQuery Optimization S. SudarshanPrasan Roy writes author paper

Query: {“sudarshan”, “roy”} Answer: subtree from graph MultiQuery Optimization S. SudarshanPrasan Roy writes author paper

Query: “sudarshan”, “roy” Answer: subtree from graph y: paper(y) & ~“sudarshan”w: paper(y) & w~“roy”AND

Similarity Queries on Graphs 1) Given type t* and node x in G, find y:T(y)=t* and y~x. 2) Given type t* and node set X, find y:T(y)=t* and y~X. Nearest-neighbor classification Information retrieval Database retrieval Evaluation: specific families of tasks for scientific publications: –Citation recommendation for a paper: (given title, year, …, of paper p, what papers should be cited by p?) –Expert-finding: (given keywords, genes, … suggest a possible author) –“Entity recommendation”: (given title, author, year, … predict entities mentioned in a paper, e.g. gene-protein entities) – can improve NER –Literature recommendation: given researcher and year, suggest papers to read that year Inference in a DB of automatically-extracted facts Core tasks in CS

Outline Motivation for Learning Similarity in Graphs A Baseline Similarity Metric Some Literature-related Tasks The Path Ranking Algorithm (Learning Method) –Motivation –Details Results: BioLiterature tasks Results: KB Inference tasks

Defining Similarity on Graphs: PPR/RWR Given type t* and node x, find y:T(y)=t* and y~x. Similarity defined by “damped” version of PageRank Similarity between nodes x and y: –“Random surfer model”: from a node z, with probability α, teleport back to x (“reset”) Else pick a y uniformly from { y’ : z  y’ } repeat from node y.... –Similarity x~y = Pr( surfer is at y | reset is always to x ) Intuitively, x~y is sum of weight of all paths from x to y, where weight of path decreases exponentially with length. Can easily extend to a “query” set X={x 1,…,x k } [Personalized PageRank 1999]

Some BioLiterature Retrieval Tasks Data used in this study –Yeast: 0.2M nodes, 5.5M links –Fly: 0.8M nodes, 3.5M links –E.g. the fly graph

Learning Proximity Measures for BioLiterature Retrieval Tasks Tasks: –Gene recommendation:author, year  gene –Reference recommendation:words,year  paper –Expert-finding:words, genes  author –Literature-recommendation: author, [papers read in past] Baseline method: –Typed RWR proximity methods Baseline learning method: –parameterize Prob(walk edge|edge label=L) and tune the parameters for each label L (somehow…) P(write)=b P(L=cite) = a P(NE) = c P(bindTo) = d P(express) = d

Path-based vs Edge-label based learning Learning one-parameter-per-edge label is limited because the context in which an edge label appears is ignored –E.g. (observed from real data – task, find papers to read) Instead, we will learn path-specific parameters PathComments Don't read about genes I’ve already read about Do read papers from my favorite authors Paths will be interpreted as constrained random walks that give a similarity-like weight to every reachable node Step 0: D 0 = {a} Start at author a Step 1: D 1 : Uniform over all papers p read by a Step 2: D 2 : Author a’ of papers in D 1 weighted by number of papers in D1 published by a’ Step 3: D 3 Papers p’ published by a’ weighted by.... …

A Limitation of RWR Learning Methods Learning one-parameter-per-edge label is limited because the context in which an edge label appears is ignored –E.g. (observed from real data – task, find papers to read) Instead, we will learn path-specific parameters PathComments Don't read about genes I’ve already read about Do read papers from my favorite authors PathComments Do read about the genes I’m working on Don't read papers from my own lab

Path Constrained Random Walks as Basis of a Proximity Measure Our work (Lao & Cohen, ECML 2010) –learn a weighted combination of simple “path experts”, each of which corresponds to a particular labeled path through the graph Citation recommendation--an example –In the TREC-CHEM Prior Art Search Task, researchers found that it is more effective to first find patents about the topic, then aggregate their citations –Our proposed model can discover this kind of retrieval schemes and assign proper weights to combine them. E.g. Weighted Paths

18 Definitions An graph G=(T,R,X,E), is –a set of entity types T={T} and a set of relations R={R} –a set of entities (nodes) X={x}, where each node x has a type from T –a set of edges e=(x,y), where each edge has a relation label from R A path P=(R 1, …,R n ) is a sequence of relations Path Constrained Random Walk –Given a query set S of “source” nodes –Distribution D 0 at time 0 is uniform over s in S –Distribution D t at time t>0 is formed by Pick x from D t-1 Pick y uniformly from all things related to x –by an edge labeled R t –Notation: f P (s,t) = Prob(s  t; P) –In our examples type of t will be determined by R n

Path Ranking Algorithm (PRA) A PRA model scores a source-target node pair by a linear function of their path features where P is the set of all relation paths with length ≤ L (with support on data, in some cases – see [Lao and Cohen EMNLP 2011]) For a relation R and a set of node pairs {(s i, t i )}, we construct a training dataset D ={(x i, y i )}, where x i is a vector of all the path features for (s i, t i ), and y i indicates whether R(s i, t i ) is true or not θ is estimated using L1,L2-regularized logistic regression [Lao & Cohen, ECML 2010]

Supervised PCRW Retrieval Model A Retrieval Model ranks target entities by linearly combining the distributions of different paths This mode can be optimized by maximizing the probability of the observed relevance –Given a set of training data D={(q (m), A (m), y (m) )}, y e (m) =1/0

21 Parameter Estimation (Details) Given a set of training data –D={(q (m), A (m), y (m) )} m=1…M, y (m) (e)=1/0 We can define a regularized objective function Use average log-likelihood as the objective o m ( θ ) – P(m) the index set or relevant entities, – N(m) the index set of irrelevant entities (how to choose them will be discussed later)

Parameter Estimation (Details) Selecting the negative entity set N m –Few positive entities vs. thousands (or millions) of negative entities? –First sort all the negative entities with an initial model (uniform weight 1.0) –Then take negative entities at the k(k+1)/2-th position, for k=1,2,…. The gradient Use orthant-wise L-BFGS (Andrew & Gao, 2007) to estimate θ –Efficient, Can deal with L1 regularization

L2 Regularization Improves retrieval quality –On the citation recommendation task

L1 Regularization Does not improve retrieval quality…

L1 Regularization … but can help reduce number of features

26 Extension 1: Query Independent Paths PageRank (and other query-independent rankings): –assign an importance score (query independent) to each web page –later combined with relevance score (query dependent) We generalize pagerank to heterogeneous graphs: –We include to each query a special entity e 0 of special type T 0 –T 0 is related to all other entity types, and each type is related to all instances of that type –This defines a set of PageRank-like query independent relation paths –Compute f(*  t;P) offline for efficiency Example well cited papers productive authors all papers all authors

Extension 2: Entity-specific rankings There are entity-specific characteristics which cannot be captured by a general model –Some items are interesting to the users because of features not captured in the data –To model this, assume the identity of the entity matters –Introduce new features f(s  t; P s,t ) to account for jumping from s to t and new features f(*  t; P *,t ) –At each gradient step, add a few new features of this sort with highest gradient, count on regularization to avoid overfitting

Extension 3: Speeding up random walks Prior work on speeding up personalized PageRank/RWR –Pre-computing components (eg Jeh & Widom 2003) –Sampling-based approaches (eg Fogaras et al, 2005) –Pre-clustering data (eg Tong et al 2006) –Pruning approaches (eg Andersen et al, 2006) We use hybrid sample/pruning based approach (“Weighted particle filtering” + “low variance sampling”) –Same approximation used at training and test time –Speedups up to x w/ little loss (sometimes some gain!) in performance [Lao and Cohen, KDD 2010]

Ext.2: Popular Entities For a task with query type T 0, and target type T q, –Introduce a bias θ e for each entity e in I E (T q ) –Introduce a bias θ e’,e for each entity pair (e’,e) where e in I E (T q ) and e’ in I E (T 0 ) Then –Or in matrix form Efficiency consideration –Only add to the model top J parameters (measured by |O(θ)/θ e | ) at each LBFGS iteration

30 Experiment Setup for BioLiterature Data sources for bio-informatics –PubMed on-line archive of over 18 million biological abstracts –PubMed Central (PMC) full-text copies of over 1 million of these papers –Saccharomyces Genome Database (SGD) a database for yeast –Flymine a database for fruit flies Tasks –Gene recommendation:author, year  gene –Venue recommendation:genes, title words  journal –Reference recommendation:title words,year  paper –Expert-finding:title words, genes  author Data split –2000 training, 2000 tuning, 2000 test Time variant graph –each edge is tagged with a time stamp (year) –only consider edges that are earlier than the query, during random walk

BioLiterature: Some Results Compare the MAP of PRA to –RWR model –query independent paths (qip) –popular entity biases (pop) Except these †, all improvements are statistically significant at p<0.05 using paired t-test

Example Path Features and their Weights A PRA+qip+pop model trained for the citation recommendation task on the yeast data 6) approx. standard IR retrieval 1) papers co-cited with on-topic papers 7,8) papers cited during the past two years 9) well cited papers 12,13) papers published during the past two years 10,11) key early papers about specific genes 14) old papers

Outline Motivation for Learning Similarity in Graphs A Baseline Similarity Metric Some Literature-related Tasks The Path Ranking Algorithm (Learning Method) –Motivation –Details Results: BioLiterature tasks Results: KB Inference tasks [Lao, Mitchell, Cohen, EMNLP 2011]

Large Scale Knowledge-Bases Large-Scale Collections of Automatically Extracted Knowledge –KnowItAll (Univ. Washington) 0.5B facts extracted from 0.1B web pages –DBpedia (Univ. Leipzig) 3.5M entities 0.7B facts extracted from wikipedia –YAGO (Max-Planck-Institute) 2M entities 20M facts extracted from Wikipedia and wordNet –FreeBase 20M entities 0.3B links, integrated from different data sources and human judgments –NELL (Never-Ending Language Learning, CMU) 0.85M facts extracted from 0.5B webpages

Inference in Noisy Knowledge Bases Challenges –Robustness: extracted knowledge is incomplete and noisy –Scalability: the size of knowledge base is large

The NELL Case Study Never-Ending Language Learning: “a never-ending learning system that operates 24 hours per day, for years, to continuously improve its ability to read (extract structured facts from) the web” (Carlson et al., 2010) Closed domain, semi-supervised extraction Combines multiple strategies: morphological patterns, textual context, html patterns, logical inference Example beliefs

A Link Prediction Task We consider 48 relations for which NELL database has more than 100 instances We create two link prediction tasks for each relation –AthletePlaysInLeague(HinesWard,?) –AthletePlaysInLeague(?, NFL) The actual nodes y known to satisfy R(x; ?) are treated as labeled positive examples, and all other nodes are treated as negative examples

Current NELL method (baseline) FOIL (Quinlan and Cameron-Jones, 1993) is a learning algorithm similar to decision trees, but in relational domains NELL implements two assumptions for efficient learning –The predicates are functional --e.g. an athlete plays in at most one league –Only find clauses that correspond to bounded-length paths of binary relations -- relational pathfinding (Richards & Mooney, 1992) 6/26/201538

FOL not great for handling uncertainty –FOIL can only combine rules with disjunctions, therefore cannot leverage low accuracy rules –E.g. rules for teamPlaysSports High accuracy but low recall Current NELL method (baseline)

Experiments - Cross Validation on KB data (for parameter setting, etc) RWR: Random Walk with Restart (PPR) † Paired t-test give p-values 7x10 -3, 9x10 -4, 9x10 -8, 4x10 -4 † † † †

Example Paths Synonyms of the query team

Evaluation by Mechanical Turk There are many test queries per predicate –All entities of a predicate’s domain/range, e.g. WorksFor(person, organization) –On average 7,000 test queries for each functional predicate, and 13,000 for each non-functional predicate Sampled evaluation –We only evaluate the top ranked result for each query –We sort the queries for each predicate according to the scores of their top ranked results, and then evaluate precisions at top 10, 100 and 1000 queries Each belief is voted by 5 workers –Workers are given assertions like “Hines Ward plays for the team Steelers”, as well as Google search links for each entity

Evaluation by Mechanical Turk On 8 functional predicates where N-FOIL can successfully learn –PRA is comparable to N-FOIL for but has significantly better On 8 randomly sampled non-functional (one-many) predicates –Slightly lower accuracy than functional predicates Task#Rule s N-FOIL 0 #Path s PRA 0 Functional Predicates2.1(+37) Non-functional Predicates PRA: Path Ranking Algorithm

Outline Motivation for Learning Similarity in Graphs A Baseline Similarity Metric Some Literature-related Tasks The Path Ranking Algorithm (Learning Method) –Motivation –Details Results: BioLiterature tasks Results: KB Inference tasks [Lao, Mitchell, Cohen, EMNLP 2011]

Outline Motivation for Learning Similarity in Graphs A Baseline Similarity Metric Some Literature-related Tasks The Path Ranking Algorithm (Learning Method) –Motivation –Details Results: BioLiterature tasks Results: KB Inference tasks Conclusions

Summary/Conclusion Learning is the way to make a clean, elegant formulation of a task work in the messy, complicated real world Learning how to navigate graphs is a significant, core task that models –Recommendation, expert-finding, … –Information retrieval –Inference in KBs –… It includes significant, core learning problems –Regularization/search of huge feature space –Discovery: long paths, lexicalized paths, … –Incorporating knowledge of graph structure … –….

47 Thanks to: –The dedicated and persistent –NSF grant IIS –NIH grant R01GM –Gifts from Google –MLG Organizers!