Download presentation
Presentation is loading. Please wait.
1
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003
2
Link Prediction Link Prediction is an important problem arising in many domains –Web pages –Computers –Scientific publications –Organizations –People Being able to predict the presence of links or connections in a domain is both important and difficult to do well
3
Characteristics in Link Prediction Domains Their nature is inherently multi-relational –This makes the standard “flat” file domain representation inadequate Data is often noisy or partially observed –e.g. articles may be cited for any number of reasons which reasons are not fully observed
4
Typical Learning Approaches Assume one-table “flat” domain representation Process of feature creation is decoupled from feature selection (and is often performed manually) Relevant features may not be readily observed by human eyes
5
The “Full Join” Approach Perform a full join on the entire database and statistically analyze the entries –Both impractical and incorrect Size is prohibitive Notion of an object is lost (stored across multiple rows) Entries will be atomic attribute values, rather than results from a complex search Negates option to introduce intelligent search heuristics
6
The Relational Method Integrates standard statistical modeling (logistic regression) with a process for systematically generating features from relational data Feature generation is formulated as search in the space of relational database queries Space bias can be controlled by specifying valid query types –Aggregations or statistical operations –Groupings –Richer join conditions –Arg-max based queries Allows for discovery of complex, interesting relationships
7
Link Prediction in the Citeseer Domain Can be used as a citation recommendation service –User would provide an abstract, author names, possibly a partial reference list Citeseer provides a rich set of relational data –Texts of titles –Abstracts and documents –Citation information –Author names and affiliations –Conference or journal names
8
Methodology Couple the two main processes –Generation of feature candidates from relational data –Their selection with statistical model selection criteria
9
Relational Feature Generation Main principle of search formulation is based on the concept of refinement graphs Start with the most general clauses and progress by refining them into more specialized clauses
10
Relational Feature Generation – Refinement Graphs Directed acyclic graphs specifying search space Constrained by specifying legal clauses –Negation and recursion disallowed Structured by partial ordering of clauses A search node is expanded (refined) to produce the most general specializations ILP systems using refinement graph search usually apply two refinement operators –Add a predicate to a clause –A single variable substitution
11
Relational Feature Generation – Aggregates Query results are aggregated to produce scalar numeric values to be used in statistical learning Any statistical aggregate can be valid, but some are expected to be more useful than others –Count –Average –Max –Min –Mode –Empty Aggregations are considered for inclusion at each node, but not factored into further search
12
Relational Feature Selection Logistic Regression is used for binary classification problems Regression coefficients are learned to maximize the likelihood function Stepwise model selection and Bayesian Information Criterion (BIC) are used to avoid overfitting
13
Tasks and Data – IID Violation The relational structure violates the assumption of independence This can be remedied by choosing the right features When the right features are used, the observations are independent given the features
14
Two Prediction Tasks 1.The identity of all objects is known. Some link structure is known. Predict unobserved links. 2.New objects arrive. Predict their links. -What do we know about the objects? -Some of their links -Some of their attributes -This paper presents results for task 1
15
The Citeseer Environment 271,343 documents 1,092,200 citations Five data sets defined –Four data sets consist of links among documents containing a certain query phrase (e.g. “artificial intelligence”) –Fifth data set includes all documents
16
Learning Methodology Populate three relations Citation, Author and PublishedIn Sample 2,500 citations each of –Positive training examples (from available links) –Negative training examples (absence of a link) –Positive test examples –Negative test examples
17
Learning Methodology Remove citations from test set (but no other relevant information) Remove citations from training set (so answers are not contained in background information) Perform learning –Using citations only –Using all relevant information (citation, authors and venue)
18
Results : Training and Test set accuracies – balanced priors DatasetBK : CitationBK: All TrainTestTrainTest “artificial intelligence”90.2489.6892.6092.14 “data mining”87.4087.2089.7089.18 “information retrieval”85.9885.3488.8888.82 “machine learning”89.4089.1491.4291.14 Entire collection92.8092.2893.6693.22
19
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.