Discriminative Learning for Markov Logic Networks

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Max-Margin Weight Learning for Markov Logic Networks
Bayesian Belief Propagation
Discriminative Training of Markov Logic Networks
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Improving the Accuracy and Scalability of Discriminative Learning Methods for Markov Logic Networks Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney PhD.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.
Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.
Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Markov Networks.
Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.
Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.
1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
Crash Course on Machine Learning
Boosting Markov Logic Networks
Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.
Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.
Web Query Disambiguation from Short Sessions Lilyana Mihalkova* and Raymond Mooney University of Texas at Austin *Now at University of Maryland College.
Markov Logic And other SRL Approaches
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
New Rules for Domain Independent Lifted MAP Inference
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Lecture 7: Constrained Conditional Models
An Introduction to Markov Logic Networks in Knowledge Bases
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Markov Logic Networks for NLP CSCI-GA.2591
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 29
CIS 700 Advanced Machine Learning for NLP Inference Applications
Machine Learning Basics
Logic for Artificial Intelligence
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning with information of features
Learning Markov Networks
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Discriminative Probabilistic Models for Relational Data
Label and Link Prediction in Relational Data
Markov Networks.
Primal Sparse Max-Margin Markov Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Discriminative Learning for Markov Logic Networks Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney PhD Proposal October 9th, 2009 Some slides are taken from [Domingos, 2007], [Mooney, 2008]

Motivation Most machine learning methods assume independent and identically distributed (i.i.d.) examples represented as feature vectors. Most of data in the real world are not i.i.d. and also cannot be effectively represented as feature vectors and Biochemical data Social network data Multi-relational data …

Biochemical data Predicting mutagenicity [Srinivasan et. al, 1995]

Web-KB dataset [Slattery & Craven, 1998]

Characteristics of these structured data Contains multiple objects/entities and relationships among them There are a lot of uncertainties in the data: Uncertainty about the attributes of an object Uncertainty about the type of an object Uncertainty about relationships between objects

Statistical Relational Learning (SRL) SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. Some proposed SRL models: Stochastic Logic Programs (SLPs) [Muggleton, 1996] Probabilistic Relational Models (PRMs) [Friedman et al., 1999] Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] Relational Markov networks (RMNs) [Taskar et al., 2002] Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

Statistical Relational Learning (SRL) SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. Some proposed SRL models: Stochastic Logic Programs (SLPs) [Muggleton, 1996] Probabilistic Relational Models (PRMs) [Friedman et al., 1999] Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] Relational Markov networks (RMNs) [Taskar et al., 2002] Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

Discriminative learning Generative learning: learn a joint model over all variables Discriminative learning: learn a conditional model of the output variables given the input variables directly learn a model for predicting the outputs  has better predictive performance on the outputs in general Most problems in structured/relational data are discriminative: make predictions based on some evidence (observable data).  Discriminative learning is more suitable

Discriminative Learning for Markov Logic Networks

Outline Motivation Background Discriminative learning for MLNs with non-recursive clause [Huynh & Mooney, 2008] Max-margin weight learning for MLNs [Huynh & Mooney, 2009] Future work Conclusion

First-Order Logic Constants: Anna, Bob Variables: x, y Function: fatherOf(x) Predicate: binary functions E.g: Smoke(x), Friends(x,y) Literals: Predicates or its negation Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground literals

First-Order Clauses Clause: A disjunction of literals Can be rewritten as a set of implication rules ¬Smoke(x) v Cancer(x) Smoke(x) => Cancer(x) ¬Cancer(x) => ¬Smoke(x)

Markov Networks [Pearl, 1988] Undirected graphical models Smoking Cancer Asthma Cough Potential function: function defined over a clique ( a complete sub-graph) Smoking Cancer Ф(S,C) False 4.5 True 2.7

Markov Networks [Pearl, 1988] Undirected graphical models Smoking Cancer Asthma Cough Log-linear model: Weight of Feature i Feature i

Markov Logic Networks [Richardson & Domingos, 2006] Set of weighted first-order clauses. Larger weight indicates stronger belief that the clause should hold. The clauses are called the structure of the MLN. MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers

Example: Friends & Smokers Two constants: Anna (A) and Bob (B)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Probability of a possible world Weight of formula i No. of true groundings of formula i in x A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.

Inference in MLNs MAP/MPE inference: find the most likely state of all unknown grounding literals given the evidence MaxWalkSAT algorithm [Kautz et al., 1997] Cutting Plane Inference algorithm [Riedel, 2008] Computing the marginal conditional probability of a set of grounding literals: P(Y=y|x) MC-SAT algorithm [Poon & Domingos, 2006] Lifted first-order belief propagation [Singla & Domingos, 2008]

Existing structure learning methods for MLNs Top-down approach: MSL[Kok & Domingos 05], [Biba etal., 2008] Start from unit clauses and search for new clauses Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL [Kok & Domingos 09] Use data to generate candidate clauses

Existing weight learning methods in MLNs Generative: maximize the (Pseudo) Log-Likelihood [Richardson & Domingos, 2006] Discriminative : maximize the Conditional Log- Likelihood (CLL) [Singla & Domingos, 2005] Structured Perceptron [Collins, 2002] [Lowd & Domingos, 2007] First and second-order methods to optimize the CLL Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best

Outline Motivation Background Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion

Drug design for Alzheimer’s disease Comparing different analogues of Tacrine drug for Alzheimer’s disease on four biochemical properties: Maximization of inhibition of amine re-uptake Minimization of toxicity Maximization of acetyl cholinesterase inhibition Maximization of the reversal of scopolamine-induced memory impairment Tacrine drug Template for the proposed drugs

Inductive Logic Programming Use first-order logic to represent background knowledge and examples Automated learning of logic rules from examples and background knowledge

Inductive Logic Programming systems GOLEM [Muggleton and Feng, 1992] FOIL [Quinlan, 1993] PROGOL [Muggleton, 1995] CHILLIN [Zelle and Mooney, 1996] ALEPH [Srinivasan, 2001]

Inductive Logic Programming example [King et al., 1995]

Results with existing learning methods for MLNs Average accuracy Data set MLN1* MLN2** ALEPH Alzheimer amine 50.1 ± 0.5 51.3 ± 2.5 81.6 ± 5.1 Alzheimer toxic 54.7 ± 7.4 51.7 ± 5.3 81.7 ± 4.2 Alzheimer acetyl 48.2 ± 2.9 55.9 ± 8.7 79.6 ± 2.2 Alzheimer memory 50 ± 0.0 49.8 ± 1.6 76.0 ± 4.9 *MLN1: MSL + PSCG **MLN2: BUSL+ PSCG What happened: The existing learning methods for MLNs fail to capture the relations between the background predicates and the target predicate New discriminative learning methods for MLNs

(Selecting good clauses) Proposed approach Clause Learner Step 1 (Generating candidate clauses) Step 2 (Selecting good clauses) Discriminative structure learning Discriminative weight learning

Discriminative structure learning Use a variant of ALEPH, called ALEPH++, to produce a larger set of candidate clauses: Score the clauses by m-estimate [Dzeroski, 1991], a Bayesian estimate of the accuracy of a clause. Keep all the clauses having an m-estimate greater than a pre-defined threshold (0.6), instead of the final theory produced by ALEPH.

ALEPH++ They are all non-recursive clauses Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) polar(CL,POLAR3) polar(OCH3,POLAR2) great_polar(POLAR3,POLAR2) size(CL,SIZE1) size(OCH3,SIZE2) great_size(SIZE2,SIZE1) alk_groups(A1,0) alk groups(B1,0) alk_groups(D1,0) alk_groups(HH1,1) flex(CL,FLEX0) flex(OCH3,FLEX1) less_toxic(A1,D1) less_toxic(B1,D1) less_toxic(HH1,A1) ALEPH++ Candidate clauses x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. They are all non-recursive clauses

Discriminative weight learning Maximize CLL with L1-regularization: Use exact inference instead of approximate inferences Use L1-regularization instead of L2-regularization

Exact inference Since the candidate clauses are non-recursive, the query predicate appears only once in each clause, i.e. the probability of a query atom being true or false only depends on the evidence

L1-regularization Put a Laplacian prior with zero mean on each weight wi L1 ignores irrelevant features by setting their weights to zero [Ng, 2004] Larger value of b, the regularizing parameter, corresponds to smaller variance of the prior distribution

CLL with L1-regularization This is convex and non-smooth optimization problem Use the Orthant-Wise Limited-memory Quasi-Newton (OWL- QN) software [Andrew & Gao, 2007] to solve the optimization problem

L1 weight learner Candidate clauses Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) … Candidate clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. L1 weight learner Weighted clauses 0.34487 alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) 2.70323 x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. 0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720)

Experiments

Datasets Datasets # Examples % Pos. example #Predicates Alzheimer amine 686 50% 30 Alzheimer toxic 886 Alzheimer acetyl 1326 Alzheimer memory 642

Methodology 10-fold cross-validation Metric: Average predictive accuracy over 10 folds

Q1: Does the proposed approach perform better than existing learning methods for MLNs and traditional ILP methods? Average accuracy

Q2: The effect of L1-regularization # of clauses

Q2: The effect of L1-regularization (cont.) Average accuracy

Q3:The benefit of collective inference Adding a transitive clause with infinite weight to the learned MLNs. less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c). Average accuracy

Q4: The performance of our approach against other “advanced ILP” methods Average accuracy

Outline Motivation Background Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion

Motivation All of the existing training methods for MLNs learn a model that produce good predictive probabilities In many applications, the actual goal is to optimize some application specific performance measures such as F1 score (harmonic mean of precision and recall) Max-margin training methods, especially Structural Support Vector Machines (SVMs), provide the framework to optimize these application specific measures  Training MLNs under the max-margin framework

Generic Strutural SVMs[Tsochantaridis et.al., 2004] Learn a discriminant function f: X x Y → R Predict for a given input x: Maximize the separation margin: Can be formulated as a quadratic optimization problem

Generic Strutural SVMs (cont.) [Joachims et.al., 2009] proposed the 1-slack formulation of the Structural SVM: Make the original cutting-plane algorithm [Tsochantaridis et.al., 2004] run faster and more scalable

Cutting plane algorithm for solving the structural SVMs Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint *Slide credit: Yisong Yue

Cutting plane algorithm for solving the 1-slack SVMs Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint *Slide credit: Yisong Yue

Cutting plane algorithm for solving the 1-slack SVMs Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint *Slide credit: Yisong Yue

Cutting plane algorithm for solving the 1-slack SVMs Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint *Slide credit: Yisong Yue

Applying the generic structural SVMs to a new problem Representation: Φ(x,y) Loss function: Δ(y,y') Algorithms to compute Prediction: Most violated constraint: separation oracle [Tsochantaridis et.al., 2004] or loss-augmented inference [Taskar et.al.,2005]

Max-margin Markov Logic Networks Maximize the ratio: Equivalent to maximize the separation margin: Can be formulated as a 1-slack Structural SVMs Joint feature: Φ(x,y)

Problems need to be solved MPE inference: Loss-augmented MPE inference: Problem: Exact MPE inference in MLNs are intractable Solution: Approximation inference via relaxation methods [Finley et.al.,2008]

Relaxation MPE inference for MLNs Many work on approximating the Weighted MAX-SAT via Linear Programming (LP) relaxation [Goemans and Williamson, 1994], [Asano and Williamson, 2002], [Asano, 2006] Convert the problem into an Integer Linear Programming (ILP) problem Relax the integer constraints to linear constraints Round the LP solution by some randomized procedures Assume the weights are finite and positive

Relaxation MPE inference for MLNs (cont.) Translate the MPE inference in a ground MLN into an ILP problem: Convert all the ground clauses into clausal form Assign a binary variable yi to each unknown ground atom and a binary variable zj to each non-deterministic ground clause Translate each ground clause into linear constraints of yi’s and zj’s

Relaxation MPE inference for MLNs (cont.) Ground MLN Translated ILP problem 3 InField(B1,Fauthor,P01) 0.5 InField(B1,Fauthor,P01) v InField(B1,Fvenue,P01) -1 InField(B1,Ftitle,P01) v InField(B1,Fvenue,P01) !InField(B1,Fauthor,P01) v !InField(a1,Ftitle,P01). !InField(B1,Fauthor,P01) v !InField(a1,Fvenue,P01). !InField(B1,Ftitle,P01) v !InField(a1,Fvenue,P01).

Relaxation MPE inference for MLNs (cont.) LP-relaxation: relax the integer constraints {0,1} to linear constraints [0,1]. Adapt the ROUNDUP [Boros and Hammer, 2002] procedure to round the solution of the LP problem Pick a non-integral component and round it in each step

Loss-augmented LP-relaxation MPE inference Represent the loss function as a linear function of yi’s: Add the loss term to the objective of the LP- relaxation  the problem is still a LP problem  can be solved by the previous algorithm

Experiments

Collective multi-label webpage classification WebKB dataset [Slattery and Craven, 1998] [Lowd and Domingos, 2007] 4,165 web pages and 10,935 web links of 4 departments Each page is labeled with a subset of 7 categories: Course, Department, Faculty, Person, Professor, Research Project, Student MLN [Lowd and Domingos, 2007] : Has(+word,page) => PageClass(+class,page) ¬Has(+word,page) => PageClass(+class,page) PageClass(+c1,p1) ^ Linked(p1,p2) => PageClass(+c2,p2)

Collective multi-label webpage classification (cont.) Largest ground MLN for one department: 8,876 query atoms 174,594 ground clauses

Citation segmentation Citeseer dataset [Lawrence et.al., 1999] [Poon and Domingos, 2007] 1,563 citations, divided into 4 research topics Each citation is segmented into 3 fields: Author, Title, Venue Used the simplest MLN in [Poon and Domingos, 2007] Largest ground MLN for one topic: 37,692 query atoms 131,573 ground clauses

Experimental setup 4-fold cross-validation Metric: F1 score Compare against the Preconditioned Scaled Conjugate Gradient (PSCG) algorithm Train with 5 different values of C: 1, 10, 100, 1000, 10000 and test with the one that performs best on training Use Mosek to solve the QP and LP problems

F1 scores on WebKB

F1 scores on WebKB(cont.)

F1 scores on Citeseer

Sensitivity to the tuning parameter

Outline Motivation Background Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion

More efficient MPE inference Goal: Don’t need to ground the whole ground MLN Current solution: Lazy inference [Singla & Domingos, 2006] exploits the sparsity of the domain CPI [Riedel, 2008] exploits redundancies in the ground network Lifted inference [Singla & Domingos, 2008; Kersting et al., 2009] exploits symmetries in the ground network Challenge: combining the advantages of these algorithms to have a more efficient algorithm

More efficient weight learning Proposed approach: online max-margin weight learning Subgradient method [Nathan Ratliff & Zinkevich, 2007] Convert the problem into an unconstrained optimization Use the LP-relaxation MPE inference algorithm to compute the subgradients Passive-aggressive algorithms [Crammer et.al., 2006] k-best MIRA algorithm [Crammer et.al., 2005]  Problem need to solve: find the k-best MPE

Discriminative structure revision Goal: revise bad clauses in the model Problems need to solve: Detect bad clauses Clauses whose weights are near zero Diagnose bad clauses Use techniques in RTAMAR [Mihalkova et al., 2007] Revise bad clauses Use top-down beam search or stochastic local search [Paes et al., 2007] or bottom-up approach [Duboc et al., 2008] Score candidate clauses in an efficient way Online structure learning and revision

Joint learning in NLP Jointly recognizing entities and relations in sentences [Roth & Yih, 2002] Construct an MLN with: Clauses that express the correlation between lexical and syntactical information and entity types Clauses that express the correlation between lexical and syntactical information and relation types Clauses that express the relationships among entities, among relations, and between entities and relations. Challenge: learning weights and do inference with this complicated MLN

Joint learning in computer vision Problem: recognize both the objects and the scene of an image Proposed approach: Use MLNs to combine the outputs of scene and object classifiers Learn an MLN that can detect both the objects and the scene Sky Tree Athlete Horse Grass Class: Polo [Li & Fei-Fei 07]

Conclusion We have presented two different discriminative learning methods for MLNs Discriminative structure and weight learning method for MLNs with non-recursive clauses Max-margin weight learners We propose: Develop more effective inference and weight learning methods Revise the clauses to improve the predictive accuracy Apply the system to joint learning problems in NLP and computer vision

Questions? Thank you!