Discriminative Learning for Markov Logic Networks

Discriminative Learning for Markov Logic Networks
Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney PhD Proposal October 9th, 2009 Some slides are taken from [Domingos, 2007], [Mooney, 2008]

Motivation Most machine learning methods assume independent and identically distributed (i.i.d.) examples represented as feature vectors. Most of data in the real world are not i.i.d. and also cannot be effectively represented as feature vectors and Biochemical data Social network data Multi-relational data …

Biochemical data Predicting mutagenicity [Srinivasan et. al, 1995]

Web-KB dataset [Slattery & Craven, 1998]

Characteristics of these structured data
Contains multiple objects/entities and relationships among them There are a lot of uncertainties in the data: Uncertainty about the attributes of an object Uncertainty about the type of an object Uncertainty about relationships between objects

Statistical Relational Learning (SRL)
SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. Some proposed SRL models: Stochastic Logic Programs (SLPs) [Muggleton, 1996] Probabilistic Relational Models (PRMs) [Friedman et al., 1999] Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] Relational Markov networks (RMNs) [Taskar et al., 2002] Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

Discriminative learning
Generative learning: learn a joint model over all variables Discriminative learning: learn a conditional model of the output variables given the input variables directly learn a model for predicting the outputs  has better predictive performance on the outputs in general Most problems in structured/relational data are discriminative: make predictions based on some evidence (observable data).  Discriminative learning is more suitable

Discriminative Learning for
Markov Logic Networks

Outline Motivation Background
Discriminative learning for MLNs with non-recursive clause [Huynh & Mooney, 2008] Max-margin weight learning for MLNs [Huynh & Mooney, 2009] Future work Conclusion

First-Order Logic Constants: Anna, Bob Variables: x, y
Function: fatherOf(x) Predicate: binary functions E.g: Smoke(x), Friends(x,y) Literals: Predicates or its negation Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground literals

First-Order Clauses Clause: A disjunction of literals
Can be rewritten as a set of implication rules ¬Smoke(x) v Cancer(x) Smoke(x) => Cancer(x) ¬Cancer(x) => ¬Smoke(x)

Markov Networks [Pearl, 1988]
Undirected graphical models Smoking Cancer Asthma Cough Potential function: function defined over a clique ( a complete sub-graph) Smoking Cancer Ф(S,C) False 4.5 True 2.7

Markov Networks [Pearl, 1988]
Undirected graphical models Smoking Cancer Asthma Cough Log-linear model: Weight of Feature i Feature i

Markov Logic Networks [Richardson & Domingos, 2006]
Set of weighted first-order clauses. Larger weight indicates stronger belief that the clause should hold. The clauses are called the structure of the MLN. MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers

Example: Friends & Smokers
Two constants: Anna (A) and Bob (B)

Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Probability of a possible world
Weight of formula i No. of true groundings of formula i in x A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.

Inference in MLNs MAP/MPE inference: find the most likely state of all unknown grounding literals given the evidence MaxWalkSAT algorithm [Kautz et al., 1997] Cutting Plane Inference algorithm [Riedel, 2008] Computing the marginal conditional probability of a set of grounding literals: P(Y=y|x) MC-SAT algorithm [Poon & Domingos, 2006] Lifted first-order belief propagation [Singla & Domingos, 2008]

Existing structure learning methods for MLNs
Top-down approach: MSL[Kok & Domingos 05], [Biba etal., 2008] Start from unit clauses and search for new clauses Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL [Kok & Domingos 09] Use data to generate candidate clauses

Existing weight learning methods in MLNs
Generative: maximize the (Pseudo) Log-Likelihood [Richardson & Domingos, 2006] Discriminative : maximize the Conditional Log- Likelihood (CLL) [Singla & Domingos, 2005] Structured Perceptron [Collins, 2002] [Lowd & Domingos, 2007] First and second-order methods to optimize the CLL Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best

Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion

Drug design for Alzheimer’s disease
Comparing different analogues of Tacrine drug for Alzheimer’s disease on four biochemical properties: Maximization of inhibition of amine re-uptake Minimization of toxicity Maximization of acetyl cholinesterase inhibition Maximization of the reversal of scopolamine-induced memory impairment Tacrine drug Template for the proposed drugs

Inductive Logic Programming
Use first-order logic to represent background knowledge and examples Automated learning of logic rules from examples and background knowledge

Inductive Logic Programming systems
GOLEM [Muggleton and Feng, 1992] FOIL [Quinlan, 1993] PROGOL [Muggleton, 1995] CHILLIN [Zelle and Mooney, 1996] ALEPH [Srinivasan, 2001]

Inductive Logic Programming example [King et al., 1995]

Results with existing learning methods for MLNs
Average accuracy Data set MLN1* MLN2** ALEPH Alzheimer amine 50.1 ± 0.5 51.3 ± 2.5 81.6 ± 5.1 Alzheimer toxic 54.7 ± 7.4 51.7 ± 5.3 81.7 ± 4.2 Alzheimer acetyl 48.2 ± 2.9 55.9 ± 8.7 79.6 ± 2.2 Alzheimer memory 50 ± 0.0 49.8 ± 1.6 76.0 ± 4.9 *MLN1: MSL + PSCG **MLN2: BUSL+ PSCG What happened: The existing learning methods for MLNs fail to capture the relations between the background predicates and the target predicate New discriminative learning methods for MLNs

(Selecting good clauses)
Proposed approach Clause Learner Step 1 (Generating candidate clauses) Step 2 (Selecting good clauses) Discriminative structure learning Discriminative weight learning

Discriminative structure learning
Use a variant of ALEPH, called ALEPH++, to produce a larger set of candidate clauses: Score the clauses by m-estimate [Dzeroski, 1991], a Bayesian estimate of the accuracy of a clause. Keep all the clauses having an m-estimate greater than a pre-defined threshold (0.6), instead of the final theory produced by ALEPH.

ALEPH++ They are all non-recursive clauses Facts r _subst_1(A1,H)
r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) polar(CL,POLAR3) polar(OCH3,POLAR2) great_polar(POLAR3,POLAR2) size(CL,SIZE1) size(OCH3,SIZE2) great_size(SIZE2,SIZE1) alk_groups(A1,0) alk groups(B1,0) alk_groups(D1,0) alk_groups(HH1,1) flex(CL,FLEX0) flex(OCH3,FLEX1) less_toxic(A1,D1) less_toxic(B1,D1) less_toxic(HH1,A1) ALEPH++ Candidate clauses x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. They are all non-recursive clauses

Discriminative weight learning
Maximize CLL with L1-regularization: Use exact inference instead of approximate inferences Use L1-regularization instead of L2-regularization

Exact inference Since the candidate clauses are non-recursive, the query predicate appears only once in each clause, i.e. the probability of a query atom being true or false only depends on the evidence

L1-regularization Put a Laplacian prior with zero mean on each weight wi L1 ignores irrelevant features by setting their weights to zero [Ng, 2004] Larger value of b, the regularizing parameter, corresponds to smaller variance of the prior distribution

CLL with L1-regularization
This is convex and non-smooth optimization problem Use the Orthant-Wise Limited-memory Quasi-Newton (OWL- QN) software [Andrew & Gao, 2007] to solve the optimization problem

L1 weight learner Candidate clauses
Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) … Candidate clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. L1 weight learner Weighted clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. 0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720)

Experiments

Datasets Datasets # Examples % Pos. example #Predicates
Alzheimer amine 686 50% 30 Alzheimer toxic 886 Alzheimer acetyl 1326 Alzheimer memory 642

Methodology 10-fold cross-validation Metric:
Average predictive accuracy over 10 folds

Q1: Does the proposed approach perform better than existing learning methods for MLNs and traditional ILP methods? Average accuracy

Q2: The effect of L1-regularization
# of clauses

Q2: The effect of L1-regularization (cont.)
Average accuracy

Q3:The benefit of collective inference
Adding a transitive clause with infinite weight to the learned MLNs. less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c). Average accuracy

Q4: The performance of our approach against other “advanced ILP” methods
Average accuracy

Motivation All of the existing training methods for MLNs learn a model that produce good predictive probabilities In many applications, the actual goal is to optimize some application specific performance measures such as F1 score (harmonic mean of precision and recall) Max-margin training methods, especially Structural Support Vector Machines (SVMs), provide the framework to optimize these application specific measures  Training MLNs under the max-margin framework

Generic Strutural SVMs[Tsochantaridis et.al., 2004]
Learn a discriminant function f: X x Y → R Predict for a given input x: Maximize the separation margin: Can be formulated as a quadratic optimization problem

Generic Strutural SVMs (cont.)
[Joachims et.al., 2009] proposed the 1-slack formulation of the Structural SVM: Make the original cutting-plane algorithm [Tsochantaridis et.al., 2004] run faster and more scalable

Cutting plane algorithm for solving the structural SVMs
Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint *Slide credit: Yisong Yue

Cutting plane algorithm for solving the 1-slack SVMs
Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint *Slide credit: Yisong Yue

Applying the generic structural SVMs to a new problem
Representation: Φ(x,y) Loss function: Δ(y,y') Algorithms to compute Prediction: Most violated constraint: separation oracle [Tsochantaridis et.al., 2004] or loss-augmented inference [Taskar et.al.,2005]

Max-margin Markov Logic Networks
Maximize the ratio: Equivalent to maximize the separation margin: Can be formulated as a 1-slack Structural SVMs Joint feature: Φ(x,y)

Problems need to be solved
MPE inference: Loss-augmented MPE inference: Problem: Exact MPE inference in MLNs are intractable Solution: Approximation inference via relaxation methods [Finley et.al.,2008]

Relaxation MPE inference for MLNs
Many work on approximating the Weighted MAX-SAT via Linear Programming (LP) relaxation [Goemans and Williamson, 1994], [Asano and Williamson, 2002], [Asano, 2006] Convert the problem into an Integer Linear Programming (ILP) problem Relax the integer constraints to linear constraints Round the LP solution by some randomized procedures Assume the weights are finite and positive

Relaxation MPE inference for MLNs (cont.)
Translate the MPE inference in a ground MLN into an ILP problem: Convert all the ground clauses into clausal form Assign a binary variable yi to each unknown ground atom and a binary variable zj to each non-deterministic ground clause Translate each ground clause into linear constraints of yi’s and zj’s

Ground MLN Translated ILP problem 3 InField(B1,Fauthor,P01) 0.5 InField(B1,Fauthor,P01) v InField(B1,Fvenue,P01) -1 InField(B1,Ftitle,P01) v InField(B1,Fvenue,P01) !InField(B1,Fauthor,P01) v !InField(a1,Ftitle,P01). !InField(B1,Fauthor,P01) v !InField(a1,Fvenue,P01). !InField(B1,Ftitle,P01) v !InField(a1,Fvenue,P01).

LP-relaxation: relax the integer constraints {0,1} to linear constraints [0,1]. Adapt the ROUNDUP [Boros and Hammer, 2002] procedure to round the solution of the LP problem Pick a non-integral component and round it in each step

Loss-augmented LP-relaxation MPE inference
Represent the loss function as a linear function of yi’s: Add the loss term to the objective of the LP- relaxation  the problem is still a LP problem  can be solved by the previous algorithm

Experiments

Collective multi-label webpage classification
WebKB dataset [Slattery and Craven, 1998] [Lowd and Domingos, 2007] 4,165 web pages and 10,935 web links of 4 departments Each page is labeled with a subset of 7 categories: Course, Department, Faculty, Person, Professor, Research Project, Student MLN [Lowd and Domingos, 2007] : Has(+word,page) => PageClass(+class,page) ¬Has(+word,page) => PageClass(+class,page) PageClass(+c1,p1) ^ Linked(p1,p2) => PageClass(+c2,p2)

Collective multi-label webpage classification (cont.)
Largest ground MLN for one department: 8,876 query atoms 174,594 ground clauses

Citation segmentation
Citeseer dataset [Lawrence et.al., 1999] [Poon and Domingos, 2007] 1,563 citations, divided into 4 research topics Each citation is segmented into 3 fields: Author, Title, Venue Used the simplest MLN in [Poon and Domingos, 2007] Largest ground MLN for one topic: 37,692 query atoms 131,573 ground clauses

Experimental setup 4-fold cross-validation Metric: F1 score
Compare against the Preconditioned Scaled Conjugate Gradient (PSCG) algorithm Train with 5 different values of C: 1, 10, 100, 1000, and test with the one that performs best on training Use Mosek to solve the QP and LP problems

F1 scores on WebKB

F1 scores on WebKB(cont.)

F1 scores on Citeseer

Sensitivity to the tuning parameter

More efficient MPE inference
Goal: Don’t need to ground the whole ground MLN Current solution: Lazy inference [Singla & Domingos, 2006] exploits the sparsity of the domain CPI [Riedel, 2008] exploits redundancies in the ground network Lifted inference [Singla & Domingos, 2008; Kersting et al., 2009] exploits symmetries in the ground network Challenge: combining the advantages of these algorithms to have a more efficient algorithm

More efficient weight learning
Proposed approach: online max-margin weight learning Subgradient method [Nathan Ratliff & Zinkevich, 2007] Convert the problem into an unconstrained optimization Use the LP-relaxation MPE inference algorithm to compute the subgradients Passive-aggressive algorithms [Crammer et.al., 2006] k-best MIRA algorithm [Crammer et.al., 2005]  Problem need to solve: find the k-best MPE

Discriminative structure revision
Goal: revise bad clauses in the model Problems need to solve: Detect bad clauses Clauses whose weights are near zero Diagnose bad clauses Use techniques in RTAMAR [Mihalkova et al., 2007] Revise bad clauses Use top-down beam search or stochastic local search [Paes et al., 2007] or bottom-up approach [Duboc et al., 2008] Score candidate clauses in an efficient way Online structure learning and revision

Joint learning in NLP Jointly recognizing entities and relations in sentences [Roth & Yih, 2002] Construct an MLN with: Clauses that express the correlation between lexical and syntactical information and entity types Clauses that express the correlation between lexical and syntactical information and relation types Clauses that express the relationships among entities, among relations, and between entities and relations. Challenge: learning weights and do inference with this complicated MLN

Joint learning in computer vision
Problem: recognize both the objects and the scene of an image Proposed approach: Use MLNs to combine the outputs of scene and object classifiers Learn an MLN that can detect both the objects and the scene Sky Tree Athlete Horse Grass Class: Polo [Li & Fei-Fei 07]

Conclusion We have presented two different discriminative learning methods for MLNs Discriminative structure and weight learning method for MLNs with non-recursive clauses Max-margin weight learners We propose: Develop more effective inference and weight learning methods Revise the clauses to improve the predictive accuracy Apply the system to joint learning problems in NLP and computer vision

Questions? Thank you!

Discriminative Learning for Markov Logic Networks

Similar presentations

Presentation on theme: "Discriminative Learning for Markov Logic Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discriminative Learning for Markov Logic Networks

Similar presentations

Presentation on theme: "Discriminative Learning for Markov Logic Networks"— Presentation transcript:

Similar presentations

About project

Feedback