Boosting Markov Logic Networks

Slides:



Advertisements
Similar presentations
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Advertisements

Discriminative Training of Markov Logic Networks
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
S TATISTICAL R ELATIONAL L EARNING Joint Work with Sriraam Natarajan, Kristian Kersting, Jude Shavlik.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Learning First-Order Probabilistic Models with Combining Rules Sriraam Natarajan Prasad Tadepalli Eric Altendorf Thomas G. Dietterich Alan Fern Angelo.
1 Learning Markov Logic Network Structure Via Hypergraph Lifting Stanley Kok Dept. of Computer Science and Eng. University of Washington Seattle, USA Joint.
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Speeding Up Inference in Markov Logic Networks by Preprocessing to Reduce the Size of the Resulting Grounded Network Jude Shavlik Sriraam Natarajan Computer.
Adbuctive Markov Logic for Plan Recognition Parag Singla & Raymond J. Mooney Dept. of Computer Science University of Texas, Austin.
Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.
School of Computing Science Simon Fraser University Vancouver, Canada.
Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Recursive Random Fields Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.
Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
© Jesse Davis 2006 View Learning Extended: Learning New Tables Jesse Davis 1, Elizabeth Burnside 1, David Page 1, Vítor Santos Costa 2 1 University of.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.
Practical Probabilistic Relational Learning Sriraam Natarajan.
1 Transfer Learning by Mapping and Revising Relational Knowledge Raymond J. Mooney University of Texas at Austin with acknowledgements to Lily Mihalkova,
Markov Logic And other SRL Approaches
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS
Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)
Learning to “Read Between the Lines” using Bayesian Logic Programs Sindhu Raghavan, Raymond Mooney, and Hyeonseo Ku The University of Texas at Austin July.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 17 Wednesday, 01 October.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
A User-Guided Approach to Program Analysis Ravi Mangal, Xin Zhang, Mayur Naik Georgia Tech Aditya Nori Microsoft Research.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Happy Mittal Advisor : Parag Singla IIT Delhi Lifted Inference Rules With Constraints.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Learning Relational Dependency Networks for Relation Extraction
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Boosting and Additive Trees (2)
Markov Logic Networks for NLP CSCI-GA.2591
Approximate Inference
CSCI 121 Special Topics: Bayesian Networks Lecture #2: Bayes Nets
Lifted First-Order Probabilistic Inference [de Salvo Braz, Amir, and Roth, 2005] Daniel Lowd 5/11/2005.
Fast Learning of Relational Dependency Networks
Learning Markov Networks
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Presentation transcript:

Boosting Markov Logic Networks Tushar Khot Joint work with Sriraam Natarajan, Kristian Kersting and Jude Shavlik

Sneak Peek Present a method to learn structure and parameter for MLNs simultaneously Use functional gradients to learn many weakly predictive models Use regression trees/clauses to fit the functional gradients Faster and more accurate results than state-of-the-art structure learning methods 1.0 publication(A,P), publication(B, P) → advisedBy(A,B) ψm p(X) q(X,Y) W1 W2 W3 n[p(X) ] > 0 n[q(X,Y) ] > 0 n[q(X,Y)] = 0 In today’s talk I will present our approach to learn structure and parameters for Markov Logic Networks simultaneously. I will talk about how we use FGB to learn multiple weak models. I will also show how we use relational regression trees to fit the functional gradients. Lastly, I will present our results that shows we are faster and more accurate than state-of-the-art

Outline Background Functional Gradient Boosting Representations Regression Trees Regression Clauses Experiments Conclusions This is the general outline of my talk. I would present some background on FGB and MLNs before going on to explain how we apply FGB to MLNs.I will then talk about the two representations that we used, followed by the experiments and conclusions.

Traditional Machine Learning Task: Predicting whether burglary occurred at the home Burglary Earthquake B E A M J 1 . Alarm MaryCalls Traditional machine learning uses a set of features and each example is assumed to be iid and can be represented as a fixed length feature vector. In this example, we have 5 features: whether a burglary occurred, if there was an earthquake in the city, whether the house alarm is ringing , whether your neighbor Mary or John called. A sample dataset corresponding to these features is shown on the right JohnCalls Features Data

Parameter Learning Structure Learning Earthquake Burglary Alarm P(E) 0.1 Earthquake P(B) 0.1 Burglary P(A) B E 0.9 B E 0.5 B E 0.4 B E 0.1 Alarm Structure learning for a Bayesian Network would correspond to learning the parents of each feature. In this example, Alarm has B and E as parents and JohnC and MaryC has the same parent Alarm. The weight learning task would correspond to learning the parameters on each node which corresponds to the CPD. For e.g. Alarm node has 4 parameters that corresponds to the probability of Alarm being true for all 4 configurations of the parents. So if Burglary has not occurred and Earthquake has occurred then the probability of Alarm ringing is 0.6 P(M) A 0.7  A 0.2 MaryCalls JohnCalls P(J) A 0.9 A 0.1

Real-World Datasets Previous Blood Tests Patients Previous Rx Previous Mammograms But data in real world may not be so simple. Consider an EHR dataset where we have a list of patients. Each patient would have multiple test results and prescriptions. Also patients would have different number of tests performed as well as all tests wouldn’t be performed on all the patients. Hence it is non-trivial to convert this dataset into an iid dataset with fxd len f.v. Key challenge different amount of data for each patient

Inductive Logic Programming ILP directly learns first-order rules from structured data Searches over the space of possible rules Key limitation The rules are evaluated to be true or false, i.e. deterministic Ind.. L.. Pro handles this problem by using first order logic to represent structured data. It learns FOL rules using a greedy search approach. For e.g., the rule says that if a patient has a mass in two consecutive scans, then we should do a biopsy. But the issue with ILP is that the rules are deterministic ie they either evaluate to be true/false.

Logic + Probability = Statistical Relational Learning Models Add Probabilities Statistical Relational Learning (SRL) One way to resolve this issue, is by adding probabilities to FOL ie we can think about each rule that is learnt is true with some probability. This would give us Stat. Rel Lear model Or we could imagine adding first order logic or relns to probabilistic models. Probabilities Add Relations

Markov Logic Networks Weighted logic Structure Weights Weight of formula i Number of true groundings of formula i in worldState Friends(A,A) Friends(A,B) Smokes(A) Friends(B,B) Friends(B,A) Smokes(B) Friends(A,A) Friends(A,B) Smokes(A) Friends(B,B) Friends(B,A) Smokes(B) Popular SRL model – Mar Lo Ne MLN contains wted fol rules Explain example Rules corr to struct Parameters corr to wt Prob of world state in mln calculated using formula here. Z is norm term To compute ni, we generally grnd the mln to mn. Consider only 2nd rule with two consts A & B. ground mn is shown here. A world state is shown, red – false, green-true. In this state, 3 true gndgs and 1 false gndg. So ni=3 (Richardson & Domingos, MLJ 2005)

Learning MLNs – Prior Approaches Weight learning Requires hand-written MLN rules Uses gradient descent Needs to ground the Markov network Hence can be very slow Structure learning Harder problem Needs to search space of possible clauses Each new clause requires weight-learning step

Motivation for Boosting MLNs True model may have a complex structure Hard to capture using a handful of highly accurate rules Our approach Use many weakly predictive rules Learn structure and parameters simultaneously Hard to obtain complex model from expert and do wt learning or capture using long accurate rules

Problem Statement Given Training Data First Order Logic facts Ground target predicates Learn weighted rules for target predicates student(Alice) professor(Bob) publication(Alice, Paper157) advisedBy(Alice,Bob) 1.2 publication(A,P), publication(B, P) → advisedBy(A,B) . . .

Outline Background Functional Gradient Boosting Representations Regression Trees Regression Clauses Experiments Conclusions

Functional Gradient Boosting Model = weighted combination of a large number of simple functions ψm Data = Gradients Induce vs Initial Model Predictions Iterate + Take an init model. It could be expert advice or just prior. Use predictions to compute gradient or residues. Learn a regression function to fit the residues. Update model. Sum of all reg func gives the final model. + Final Model = + … J.H. Friedman. Greedy function approximation: A gradient boosting machine.

Function Definition for Boosting MLNs Probability of an example We define the function ψ as ntj corresponds to non-trivial groundings of clause Cj Using non-trivial groundings allows us to avoid unnecessary computation In this work we derived the fg for mlns. Consider prob of eg given mb. Mb of an example corresponds to all the nbrs of the example in ground mn. For .e.g the green nodes is the mb of purple node. We define function corresponding to the functional grad as shown here. We use non-trival instead of num grnds in our function. This avoids unnecessary computation and allows efficient learning of the mln struct. ( Shavlik & Natarajan IJCAI'09)

Functional Gradients in MLN Probability of example xi Gradient at example xi Given the def of the function, prob of example is a sigmoid over the function. Smlr to prevv fgb methods, our grad corresponds to difference between observed label and computed probability. +ve eg with curr prob as 0.1 would have gradient of 0.9 whereas neg example would have grad of -0.1 So we want to learn a reg function that min least square error .

Outline Background Functional Gradient Boosting Representations Regression Trees Regression Clauses Experiments Conclusions We use two representations for reg func – tree and clauses

Learning Trees for Target(X) p(X) Learning Clauses n[p(X) ] > 0 n[p(X)] = 0 Same as squared error for trees Force weight on false branches (W3 ,W2) to be 0 Hence no existential vars needed q(X,Y) W3 n[q(X,Y)] > 0 n[q(X,Y)] = 0 W1 W2 Explain tree Root split based on num of gndgs or condition on node wt w3 is fixed evaluating q(x,y), y is exist variable split node based on existence chk - wt have closed form soln. so easy to evaluate. Greedy search like dec tree. Mln corr to tree - exist var  slow inference so learn one clauses. Force wts=0 on false .0 wt clause has no impact As shown here no exist vars Closed-form solution for weights given residues (see paper) False branch sometimes introduces existential variables I J

Jointly Learning Multiple Target Predicates targetX targetY targetX Data = Gradients Induce vs Predictions Fi Approximate MLNs as a set of conditional models Extends our prior work on RDNs (ILP’10, MLJ’11) to MLNs Similar approach by Lowd & Davis (ICDM’10) for propositional Markov Networks Represent every MN conditional potentials with a single tree What if we have more than one target pred. use prev models to compute predictions. But still learn one tree at a time. Extend our work on learning rdns in ilp and extend work by lowd & davis later that yr on mn to relational setting.

Boosting MLNs For each gradient step m=1 to M For each query predicate, P For each example, x Generate trainset using previous model, Fm-1 Compute gradient for x Learn a regression function, Tm,p Add <x, gradient(x)> to trainset Algo Add Tm,p to the model, Fm Learn Horn clauses with P(X) as head Set Fm as current model

Agenda Background Functional Gradient Boosting Representations Regression Trees Regression Clauses Experiments Conclusions

Experiments Approaches Datasets MLN-BT MLN-BC Alch-D LHL BUSL Motif Datasets UW-CSE IMDB Cora WebKB Boosted Trees Boosted Clauses Discriminative Weight Learning (Singla’05) Learning via Hypergraph Lifting (Kok’09) Bottom-up Structure Learning (Mihalkova’07) Structural Motif (Kok’10)

Results – UW-CSE Predict advisedBy relation Given student, professor, courseTA, courseProf, etc relations 5-fold cross validation Exact inference since only single target predicate advisedBy AUC-PR CLL Time MLN-BT 0.94 ± 0.06 -0.52 ± 0.45 18.4 sec MLN-BC 0.95 ± 0.05 -0.30 ± 0.06 33.3 sec Alch-D 0.31 ± 0.10 -3.90 ± 0.41 7.1 hrs Motif 0.43 ± 0.03 -3.23 ± 0.78 1.8 hrs LHL 0.42 ± 0.10 -2.94 ± 0.31 37.2 sec Explain AUC-PR & CLL. Compare learning time.

Results – Cora Task: Entity Resolution Predict: SameBib, SameVenue, SameTitle, SameAuthor Given: HasWordAuthor, HasWordTitle, HasWordVenue Joint model considered for all predicates Explain Y axis. Pt to our approach. SameBib much better and sameauthor not very different.

Future Work Maximize the log-likelihood instead of pseudo log-likelihood Learn in presence of missing data Improve the human-readability of the learned MLNs Increase space between pts

Conclusion Presented a method to learn structure and parameter for MLNs simultaneously FGB makes it possible to learn many effective short rules Used two representation of the gradients Efficiently learn order-of-magnitude more rules Superior test set performance vs. state-of-the-art MLN structure-learning techniques

Thanks Supported By DARPA Fraunhofer ATTRACT fellowship STREAM European Commission

Non-trivial Groundings Consider p(X), q(X,Y) → target(X) ¬p(X) v ¬ q(X, Y) v target(X) Trivial true groundings for target(c) When p(c) is false When q(c, Y) is false So non-trivial groundings for xi = target(c) #true groundings[p(c)  q(c,Y)] Hence non-trivial groundings are the true groundings of body of the clause Use \not and capitalize titles

Functional Gradient Boosting Gradient descent as a sum of gradients over the parameters, θ Now instead use functional gradient at each example + + + + … J.H. Friedman. Greedy function approximation: A gradient boosting machine.

Function Definition for Boosting MLNs We maximize the pseudo log-likelihood Probability of every example Non-trivial groundings