1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington.

Slides:



Advertisements
Similar presentations
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Advertisements

Joint Inference in Information Extraction Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos)
Discriminative Training of Markov Logic Networks
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Markov Logic Networks Instructor: Pedro Domingos.
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
1 Learning Markov Logic Network Structure Via Hypergraph Lifting Stanley Kok Dept. of Computer Science and Eng. University of Washington Seattle, USA Joint.
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Speeding Up Inference in Markov Logic Networks by Preprocessing to Reduce the Size of the Resulting Grounded Network Jude Shavlik Sriraam Natarajan Computer.
Adbuctive Markov Logic for Plan Recognition Parag Singla & Raymond J. Mooney Dept. of Computer Science University of Texas, Austin.
Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Jesse Davis, Stanley Kok,
Markov Logic: A Unifying Framework for Statistical Relational Learning Pedro Domingos Matthew Richardson
Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.
School of Computing Science Simon Fraser University Vancouver, Canada.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
1 Statistical Predicate Invention Stanley Kok Dept. of Computer Science and Eng. University of Washington Joint work with Pedro Domingos.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Structure Learning. Overview Structure learning Predicate invention Transfer learning.
CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Constructing Belief Networks: Summary [[Decide on what sorts of queries you are interested in answering –This in turn dictates what factors to model in.
Recursive Random Fields Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Stanley Kok, Daniel Lowd,
Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.
Markov Logic Networks: A Unified Approach To Language Processing Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with.
Markov Logic: A Simple and Powerful Unification Of Logic and Probability Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.
Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.
Scalable Text Mining with Sparse Generative Models
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
Pedro Domingos Dept. of Computer Science & Eng.
Boosting Markov Logic Networks
Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
1 Transfer Learning by Mapping and Revising Relational Knowledge Raymond J. Mooney University of Texas at Austin with acknowledgements to Lily Mihalkova,
Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.
Web Query Disambiguation from Short Sessions Lilyana Mihalkova* and Raymond Mooney University of Texas at Austin *Now at University of Maryland College.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Markov Logic And other SRL Approaches
Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,
1 Learning the Structure of Markov Logic Networks Stanley Kok.
Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)
Learning to “Read Between the Lines” using Bayesian Logic Programs Sindhu Raghavan, Raymond Mooney, and Hyeonseo Ku The University of Texas at Austin July.
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Slides for “Data Mining” by I. H. Witten and E. Frank.
First-Order Logic and Inductive Logic Programming.
John Lafferty Andrew McCallum Fernando Pereira
Markov Logic Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.
Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:
Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
General Graphical Model Learning Schema
Markov Logic Networks for NLP CSCI-GA.2591
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30
Markov Networks.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical Relational AI
Presentation transcript:

1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington

2 Overview  Motivation  Background  Structure Learning Algorithm  Experiments  Future Work & Conclusion

3 Motivation  Statistical Relational Learning (SRL) combines the benefits of: Statistical Learning: uses probability to handle uncertainty in a robust and principled way Relational Learning: models domains with multiple relations

4 Motivation  Many SRL approaches combine a logical language and Bayesian networks e.g. Probabilistic Relational Models [Friedman et al., 1999]  The need to avoid cycles in Bayesian networks causes many difficulties [Taskar et al., 2002]  Started using Markov networks instead

5 Motivation  Relational Markov Networks [Taskar et al., 2002] conjunctive database queries + Markov networks Require space exponential in the size of the cliques  Markov Logic Networks [Richardson & Domingos, 2004] First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system)

6 Motivation  Relational Markov Networks [Taskar et al., 2002] conjunctive database queries + Markov networks Require space exponential in the size of the cliques  Markov Logic Networks [Richardson & Domingos, 2004] First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system)  This paper develops a fast algorithm that learns MLN structure Most powerful SRL learner to date

7 Overview  Motivation  Background  Structure Learning Algorithm  Experiments  Future Work & Conclusion

8 Markov Logic Networks  First-order KB: set of hard constraints Violate one formula, a world has zero probability  MLNs soften constraints OK to violate formulas The fewer formulas a world violates, the more probable it is Gives each formula a weight, reflects how strong a constraint it is

9 MLN Definition  A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number  Together with a finite set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w

10 Ground Markov Network Student(STAN) Professor(PEDRO) AdvisedBy(STAN,PEDRO) Professor(STAN) Student(PEDRO) AdvisedBy(PEDRO,STAN) AdvisedBy(STAN,STAN) AdvisedBy(PEDRO,PEDRO) AdvisedBy(S,P) ) Student(S) ^ Professor(P)2.7 constants: STAN, PEDRO

11 MLN Model

12 MLN Model Vector of value assignments to ground predicates

13 MLN Model Vector of value assignments to ground predicates Partition function. Sums over all possible value assignments to ground predicates

14 MLN Model Vector of value assignments to ground predicates Partition function. Sums over all possible value assignments to ground predicates Weight of i th formula

15 MLN Model Vector of value assignments to ground predicates Partition function. Sums over all possible value assignments to ground predicates Weight of i th formula # of true groundings of i th formula

16 MLN Weight Learning  Likelihood is concave function of weights  Quasi-Newton methods to find optimal weights e.g. L-BFGS [ Liu & Nocedal, 1989]

17 MLN Weight Learning  Likelihood is concave function of weights  Quasi-Newton methods to find optimal weights e.g. L-BFGS [ Liu & Nocedal, 1989] SLOW #P-complete

18 MLN Weight Learning  Likelihood is concave function of weights  Quasi-Newton methods to find optimal weights e.g. L-BFGS [ Liu & Nocedal, 1989] SLOW #P-complete SLOW #P-complete

19 MLN Weight Learning  R&D used pseudo-likelihood [Besag, 1975]

20 MLN Weight Learning  R&D used pseudo-likelihood [Besag, 1975]

21 MLN Structure Learning  R&D “learned” MLN structure in two disjoint steps: Learn first-order clauses with an off-the-shelf ILP system (CLAUDIEN [De Raedt & Dehaspe, 1997] ) Learn clause weights by optimizing pseudo-likelihood  Unlikely to give best results because CLAUDIEN find clauses that hold with some accuracy/frequency in the data don’t find clauses that maximize data’s (pseudo-)likelihood

22 Overview  Motivation  Background  Structure Learning Algorithm  Experiments  Future Work & Conclusion

23  This paper develops an algorithm that: Learns first-order clauses by directly optimizing pseudo-likelihood Is fast enough Performs better than R&D, pure ILP, purely KB and purely probabilistic approaches MLN Structure Learning

24 Structure Learning Algorithm  High-level algorithm REPEAT MLN Ã MLN [ FindBestClauses(MLN) UNTIL FindBestClauses(MLN) returns NULL  FindBestClauses(MLN) Create candidate clauses FOR EACH candidate clause c Compute increase in evaluation measure of adding c to MLN RETURN k clauses with greatest increase

25 Structure Learning  Evaluation measure  Clause construction operators  Search strategies  Speedup techniques

26 Evaluation Measure  R&D used pseudo-log-likelihood  This gives undue weight to predicates with large # of groundings

27  Weighted pseudo-log-likelihood (WPLL)  Gaussian weight prior  Structure prior Evaluation Measure

28  Weighted pseudo-log-likelihood (WPLL)  Gaussian weight prior  Structure prior Evaluation Measure weight given to predicate r

29  Weighted pseudo-log-likelihood (WPLL)  Gaussian weight prior  Structure prior Evaluation Measure sums over groundings of predicate r weight given to predicate r

30  Weighted pseudo-log-likelihood (WPLL)  Gaussian weight prior  Structure prior Evaluation Measure sums over groundings of predicate r weight given to predicate r CLL: conditional log-likelihood

31 Clause Construction Operators  Add a literal (negative/positive)  Remove a literal  Flip signs of literals  Limit # of distinct variables to restrict search space

32 Beam Search  Same as that used in ILP & rule induction  Repeatedly find the single best clause

33 Shortest-First Search (SFS) 1. Start from empty or hand-coded MLN 2. FOR L Ã 1 TO MAX_LENGTH 3. Apply each literal addition & deletion to each clause to create clauses of length L 4. Repeatedly add K best clauses of length L to the MLN until no clause of length L improves WPLL  Similar to Della Pietra et al. (1997), McCallum (2003)

34 Speedup Techniques  FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase

35 Speedup Techniques  FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase SLOW Many candidates

36 Speedup Techniques  FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase SLOW Many candidates SLOW Many CLLs SLOW Each CLL involves a #P-complete problem

37 Speedup Techniques  FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase SLOW Many candidates NOT THAT FAST SLOW Many CLLs SLOW Each CLL involves a #P-complete problem

38 Speedup Techniques  Clause Sampling  Predicate Sampling  Avoid Redundancy  Loose Convergence Thresholds  Ignore Unrelated Clauses  Weight Thresholding

39 Speedup Techniques  Clause Sampling  Predicate Sampling  Avoid Redundancy  Loose Convergence Thresholds  Ignore Unrelated Clauses  Weight Thresholding

40 Speedup Techniques  Clause Sampling  Predicate Sampling  Avoid Redundancy  Loose Convergence Thresholds  Ignore Unrelated Clauses  Weight Thresholding

41 Speedup Techniques  Clause Sampling  Predicate Sampling  Avoid Redundancy  Loose Convergence Thresholds  Ignore Unrelated Clauses  Weight Thresholding

42 Speedup Techniques  Clause Sampling  Predicate Sampling  Avoid Redundancy  Loose Convergence Thresholds  Ignore Unrelated Clauses  Weight Thresholding

43 Speedup Techniques  Clause Sampling  Predicate Sampling  Avoid Redundancy  Loose Convergence Thresholds  Ignore Unrelated Clauses  Weight Thresholding

44 Overview  Motivation  Background  Structure Learning Algorithm  Experiments  Future Work & Conclusion

45 Experiments  UW-CSE domain 22 predicates, e.g., AdvisedBy(X,Y), Student(X), etc. 10 types, e.g., Person, Course, Quarter, etc. # ground predicates ¼ 4 million # true ground predicates ¼ 3000 Handcrafted KB with 94 formulas  Each student has at most one advisor  If a student is an author of a paper, so is her advisor  Cora domain Computer science research papers Collective deduplication of author, venue, title

46 Systems MLN(SLB): structure learning with beam search MLN(SLS): structure learning with SFS

47 Systems MLN(SLB) MLN(SLS) KB: hand-coded KB CL: CLAUDIEN FO: FOIL AL: Aleph

48 Systems MLN(SLB) MLN(SLS) KB CL FO AL MLN(KB) MLN(CL) MLN(FO) MLN(AL)

49 Systems MLN(SLB) MLN(SLS) NB: Naïve Bayes BN: Bayesian networks KB CL FO AL MLN(KB) MLN(CL) MLN(FO) MLN(AL)

50 Methodology  UW-CSE domain DB divided into 5 areas: AI, Graphics, Languages, Systems, Theory Leave-one-out testing by area  Measured average CLL of the ground predicates average area under the precision-recall curve of the ground predicates (AUC)

51 MLN(SLS) MLN(SLB) MLN(CL)MLN(FO)MLN(AL)MLN(KB) CLFOALKB CLL AUC MLN(SLS)MLN(SLB) MLN(CL) MLN ( FO) MLN(AL) MLN(KB) CLFOALKB UW-CSE

52 MLN(SLS) MLN(SLB) MLN(CL)MLN(FO)MLN(AL)MLN(KB) CLFOALKB CLL AUC MLN(SLS)MLN(SLB) MLN(CL) MLN ( FO) MLN(AL) MLN(KB) CLFOALKB UW-CSE

53 MLN(SLS) MLN(SLB) MLN(CL)MLN(FO)MLN(AL)MLN(KB) CLFOALKB CLL AUC MLN(SLS)MLN(SLB) MLN(CL) MLN ( FO) MLN(AL) MLN(KB) CLFOALKB UW-CSE

54 MLN(SLS) MLN(SLB) MLN(CL)MLN(FO)MLN(AL)MLN(KB) CLFOALKB CLL AUC MLN(SLS)MLN(SLB) MLN(CL) MLN ( FO) MLN(AL) MLN(KB) CLFOALKB UW-CSE

55 CLL AUC MLN(SLS)MLN(SLB) NBBN MLN(SLS)MLN(SLB) NBBN UW-CSE

56 Timing  MLN(SLS) on UW-CSE Cluster of 15 dual-CPUs 2.8 GHz Pentium 4 machines Without speedups: did not finish in 24 hrs With speedups: 5.3 hrs

57 Lesion Study  Disable one speedup technique at a time; SFS UW-CSE (one-fold) Hour all speedups no clause sampling no predicate sampling don’t avoid redundancy no loose converg. threshold no weight thresholding

58 Overview  Motivation  Background  Structure Learning Algorithm  Experiments  Future Work & Conclusion

59 Future Work  Speed up counting of # true groundings of clause  Probabilistically bound the loss in accuracy due to subsampling  Probabilistic predicate discovery

60 Conclusion  Markov logic networks: a powerful combination of first-order logic and probability  Richardson & Domingos (2004) did not learn MLN structure  We develop an algorithm that automatically learns both first-order clauses and their weights  We develop speedup techniques to make our algorithm fast enough to be practical  We show experimentally that our algorithm outperforms Richardson & Domingos Pure ILP Purely KB approaches Purely probabilistic approaches  (For software,