Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Slides:

Advertisements

Similar presentations

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Advertisements

Max-Margin Weight Learning for Markov Logic Networks

Joint Inference in Information Extraction Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos)

Discriminative Training of Markov Logic Networks

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Computer vision: models, learning and inference Chapter 8 Regression.

Markov Logic Networks Instructor: Pedro Domingos.

Nonlinear Regression Ecole Nationale Vétérinaire de Toulouse Didier Concordet ECVPT Workshop April 2011 Can be downloaded at

Computer vision: models, learning and inference

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u

Speeding Up Inference in Markov Logic Networks by Preprocessing to Reduce the Size of the Resulting Grounded Network Jude Shavlik Sriraam Natarajan Computer.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Guillaume Bouchard Xerox Research Centre Europe

Markov Logic: A Unifying Framework for Statistical Relational Learning Pedro Domingos Matthew Richardson

School of Computing Science Simon Fraser University Vancouver, Canada.

Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:

Numerical Optimization

1cs542g-term Notes  Extra class this Friday 1-2pm  If you want to receive s about the course (and are auditing) send me .

Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Recursive Random Fields Daniel Lowd University of Washington June 29th, 2006 (Joint work with Pedro Domingos)

CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Weight Learning Daniel Lowd University of Washington.

Recursive Random Fields Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.

1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington.

Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Collaborative Filtering Matrix Factorization Approach

Pedro Domingos Dept. of Computer Science & Eng.

Boosting Markov Logic Networks

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.

Web Query Disambiguation from Short Sessions Lilyana Mihalkova* and Raymond Mooney University of Texas at Austin *Now at University of Maryland College.

Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.

Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS

Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)

Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Linear Models for Classification

Variations on Backpropagation.

Hand-written character recognition

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.

Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.

CSC321: Neural Networks Lecture 9: Speeding up the Learning

A Fast Trust Region Newton Method for Logistic Regression

Multimodal Learning with Deep Boltzmann Machines

Collaborative Filtering Matrix Factorization Approach

Learning Markov Networks

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Markov Networks.

Section 3: Second Order Methods

Presentation transcript:

Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Outline Background Algorithms  Gradient descent  Newton’s method  Conjugate gradient Experiments  Cora – entity resolution  WebKB – collective classification Conclusion

Markov Logic Networks Statistical Relational Learning: combining probability with first-order logic Markov Logic Network (MLN) = weighted set of first-order formulas Applications: link prediction [Richardson & Domingos, 2006], entity resolution [Singla & Domingos, 2006], information extraction [Poon & Domingos, 2007], and more…

Example: WebKB Collective classification of university web pages: Has(page, “homework”)  Class(page,Course) ¬Has(page, “sabbatical”)  Class(page,Student) Class(page1,Student)  LinksTo(page1,page2)  Class(page2,Professor)

Example: WebKB Collective classification of university web pages: Has(page,+word)  Class(page,+class) ¬Has(page,+word)  Class(page,+class) Class(page1,+class1)  LinksTo(page1,page2)  Class(page2,+class2)

Overview Discriminative weight learning in MLNs is a convex optimization problem. Problem: It can be prohibitively slow. Solution: Second-order optimization methods Problem: Line search and function evaluations are intractable. Solution: This talk!

Sneak preview

Outline Background Algorithms  Gradient descent  Newton’s method  Conjugate gradient Experiments  Cora – entity resolution  WebKB – collective classification Conclusion

Gradient descent Move in direction of steepest descent, scaled by learning rate: w t+1 = w t +  g t

Gradient descent in MLNs Gradient of conditional log likelihood is: ∂ P(Y=y|X=x)/∂ w i = n i - E[n i ] Problem: Computing expected counts is hard Solution: Voted perceptron [Collins, 2002; Singla & Domingos, 2005]  Approximate counts use MAP state  MAP state approximated using MaxWalkSAT  The only algorithm ever used for MLN discriminative learning Solution: Contrastive divergence [Hinton, 2002]  Approximate counts from a few MCMC samples  MC-SAT gives less correlated samples [Poon & Domingos, 2006]  Never before applied to Markov logic

Per-weight learning rates Some clauses have vastly more groundings than others  Smokes(X)  Cancer(X)  Friends(A,B)  Friends(B,C)  Friends(A,C) Need different learning rate in each dimension Impractical to tune rate to each weight by hand Learning rate in each dimension is:  /(# of true clause groundings)

Ill-Conditioning Skewed surface  slow convergence Condition number: (λ max /λ min ) of Hessian

The Hessian matrix Hessian matrix: all second-derivatives In an MLN, the Hessian is the negative covariance matrix of clause counts  Diagonal entries are clause variances  Off-diagonal entries show correlations Shows local curvature of the error function

Newton’s method Weight update: w = w + H -1 g We can converge in one step if error surface is quadratic Requires inverting the Hessian matrix

Diagonalized Newton’s method Weight update: w = w + D -1 g We can converge in one step if error surface is quadratic AND the features are uncorrelated (May need to determine step length…)

Conjugate gradient Include previous direction in new search direction Avoid “undoing” any work If quadratic, finds n optimal weights in n steps Depends heavily on line searches Finds optimum along search direction by function evals.

Scaled conjugate gradient Include previous direction in new search direction Avoid “undoing” any work If quadratic, finds n optimal weights in n steps Uses Hessian matrix in place of line search Still cannot store entire Hessian matrix in memory [Møller, 1993]

Step sizes and trust regions Choose the step length  Compute optimal quadratic step length: g T d/d T Hd  Limit step size to “trust region”  Key idea: within trust region, quadratic approximation is good Updating trust region  Check quality of approximation (predicted and actual change in function value)  If good, grow trust region; if bad, shrink trust region Modifications for MLNs  Fast computation of quadratic forms:  Use a lower bound on the function change: [Møller, 1993; Nocedal & Wright, 2007]

Preconditioning Initial direction of SCG is the gradient  Very bad for ill-conditioned problems Well-known fix: preconditioning  Multiply by matrix to lower condition number  Ideally, approximate inverse Hessian Standard preconditioner: D -1 [Sha & Pereira, 2003]

Outline Background Algorithms  Gradient descent  Newton’s method  Conjugate gradient Experiments  Cora – entity resolution  WebKB – collective classification Conclusion

Experiments: Algorithms Voted perceptron (VP, VP-PW) Contrastive divergence (CD, CD-PW) Diagonal Newton (DN) Scaled conjugate gradient (SCG, PSCG) Baseline: VP New algorithms: VP-PW, CD, CD-PW, DN, SCG, PSCG

Experiments: Datasets Cora  Task: Deduplicate 1295 citations to 132 papers  Weights: 6141 [Singla & Domingos, 2006]  Ground clauses: > 3 million  Condition number: > 600,000 WebKB [Craven & Slattery, 2001]  Task: Predict categories of 4165 web pages  Weights: 10,891  Ground clauses: > 300,000  Condition number: ~7000

Experiments: Method Gaussian prior on each weight Tuned learning rates on held-out data Trained for 10 hours Evaluated on test data  AUC: Area under precision-recall curve  CLL: Average conditional log-likelihood of all query predicates

Results: Cora AUC

Results: Cora CLL

Results: WebKB AUC

Results: WebKB CLL

Conclusion Ill-conditioning is a real problem in statistical relational learning PSCG and DN are an effective solution  Efficiently converge to good models  No learning rate to tune  Orders of magnitude faster than VP Details remaining  Detecting convergence  Preventing overfitting  Approximate inference Try it out in Alchemy: