Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:

Slides:



Advertisements
Similar presentations
Markov Networks Alan Ritter.
Advertisements

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.
Answer Set Programming Overview Dr. Rogelio Dávila Pérez Profesor-Investigador División de Posgrado Universidad Autónoma de Guadalajara
Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.
Relational Representations Daniel Lowd University of Oregon April 20, 2015.
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis Lecture)
Markov Logic Networks Instructor: Pedro Domingos.
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Garrette, Katrin Erk, Raymond Mooney The University of Texas at Austin Richard Montague Andrey Markov Montague.
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Markov Networks.
Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Jesse Davis, Stanley Kok,
Markov Logic: A Unifying Framework for Statistical Relational Learning Pedro Domingos Matthew Richardson
Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.
School of Computing Science Simon Fraser University Vancouver, Canada.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Recursive Random Fields Daniel Lowd University of Washington June 29th, 2006 (Joint work with Pedro Domingos)
CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Recursive Random Fields Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Stanley Kok, Daniel Lowd,
Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.
Markov Logic Networks: A Unified Approach To Language Processing Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with.
Markov Logic: A Simple and Powerful Unification Of Logic and Probability Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.
1 Human Detection under Partial Occlusions using Markov Logic Networks Raghuraman Gopalan and William Schwartz Center for Automation Research University.
Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.
On the Proper Treatment of Quantifiers in Probabilistic Logic Semantics Islam Beltagy and Katrin Erk The University of Texas at Austin IWCS 2015.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
Learning Models of Relational Stochastic Processes Sumit Sanghai.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
ICML-Tutorial, Banff, Canada, 2004 Overview 1.Introduction to PLL 2.Foundations of PLL –Logic Programming, Bayesian Networks, Hidden Markov Models, Stochastic.
Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.
Markov Logic: A Unifying Language for Information and Knowledge Management Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.
Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.
Markov Logic And other SRL Approaches
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Markov Random Fields Probabilistic Models for Images
Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Lifted First-Order Probabilistic Inference Rodrigo de Salvo Braz SRI International joint work with Eyal Amir and Dan Roth.
Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)
1 Lifted First-Order Probabilistic Inference Rodrigo de Salvo Braz University of Illinois at Urbana-Champaign with Eyal Amir and Dan Roth.
LAC group, 16/06/2011. So far...  Directed graphical models  Bayesian Networks Useful because both the structure and the parameters provide a natural.
Modeling Speech Acts and Joint Intentions in Modal Markov Logic Henry Kautz University of Washington.
1 Markov Logic Stanley Kok Dept. of Computer Science & Eng. University of Washington Joint work with Pedro Domingos, Daniel Lowd, Hoifung Poon, Matt Richardson,
CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.
Markov Logic Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Knowledge Representation Meets Machine Learning: Part 2/3 William W. Cohen Machine Learning Dept and Language Technology Dept joint work with: William.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
New Rules for Domain Independent Lifted MAP Inference
Statistical Relational Learning for NLP: Part 2/3
An Introduction to Markov Logic Networks in Knowledge Bases
Scalable Statistical Relational Learning for NLP
Markov Logic Networks for NLP CSCI-GA.2591
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 29
Logic for Artificial Intelligence
Lifted First-Order Probabilistic Inference [de Salvo Braz, Amir, and Roth, 2005] Daniel Lowd 5/11/2005.
Markov Networks.
Presentation transcript:

Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with: Kathryn Rivard Mazaitis

Outline Motivation Background – Logic – Probability – Combining logic and probabilities: MLNs ProPPR – Key ideas – Learning method – Results for parameter learning – Structure learning for ProPPR for KB completion – Joint IE and KB completion – Comparison to neural KBC models Beyond ProPPR – ….

Motivation

KR & Reasoning Inference Methods, Inference Rules Answers Queries … Challenges for KR: Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), … Complex queries: “which Canadian hockey teams have won the Stanley Cup?” Learning: how to acquire and maintain knowledge and inference rules as well as how to use it “Expressive, probabilistic, efficient: pick any two” Current state of the art What if the DB/KB or inference rules are imperfect?

Large ML-based software system Machine Learning (for complex tasks) Relational, joint learning and inference

Background H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Logic Programs A program with one definite clause (Horn clauses): grandparent(X,Y) :- parent(X,Z),parent(Z,Y) Logical variables: X,Y,Z Constant symbols: bob, alice, … We’ll consider two types of clauses: – Horn clauses A:-B1,…,Bk with no constants – Unit clauses A:- with no variables (facts): parent(alice,bob):- or parent(alice,bob) H/T: “Probabilistic Logic Programming, De Raedt and Kersting headbody“neck” Intensional definition, rules Extensional definition, database

Background: Logic Programs A program with one definite clause: grandparent(X,Y) :- parent(X,Z),parent(Z,Y) Logical variables: X,Y,Z Constant symbols: bob, alice, … Predicates: grandparent/2, parent/2 Alphabet: set of possible predicates and constants Atomic formulae: parent(X,Y), parent(alice,bob) Ground atomic formulae: parent(alice,bob), … H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Logic Programs The set of all ground atomic formulae (consistent with a fixed alphabet) is the Herbrand base of a program: {parent(alice,alice),parent(alice,bob),…,parent(zeke,zeke),gra ndparent(alice,alice),…} The interpretation of a program is a subset of the Herbrand base. An interpretation M is a model of a program if – For any A:-B1,…,Bk in the program and any mapping Theta from the variables in A,B1,..,Bk to constants: If Theta(B1) in M and … and Theta(Bk) in M then Theta(A) in M A program defines a unique least Herbrand model H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Logic Programs A program defines a unique least Herbrand model Example program: grandparent(X,Y):-parent(X,Z),parent(Z,Y). parent(alice,bob). parent(bob,chip). parent(bob,dana). The least Herbrand model also includes grandparent(alice,dana) and grandparent(alice,chip). Finding the least Herbrand model: theorem proving… Usually we case about answering queries: What are values of W: grandparent(alice,W) ? H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Motivation Inference Methods, Inference Rules Answers Queries Challenges for KR: Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), … Complex queries: “which Canadian hockey teams have won the Stanley Cup?” Learning: how to acquire and maintain knowledge and inference rules as well as how to use it query(T):- play(T,hockey), hometown(T,C), country(C,canada) {T : query(T) } ?

Background Random variable: burglary, earthquake, … Usually denote with upper-case letters: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) H/T: “Probabilistic Logic Programming, De Raedt and Kersting BEAJMprob TTTTT FTTTT …

Background: Bayes networks Random variable: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) Directed graphical models give one way of defining a compact model of the joint distribution: Queries: Pr(A=t|J=t,M=f) ? H/T: “Probabilistic Logic Programming, De Raedt and Kersting AJProb(J|A) FF0.95 FT0.05 TF0.25 TT0.75 AMProb(J|A) FF0.80 …

Background Random variable: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) Directed graphical models give one way of defining a compact model of the joint distribution: Queries: Pr(A=t|J=t,M=f) ? H/T: “Probabilistic Logic Programming, De Raedt and Kersting AJProb(J|A) FF0.95 FT0.05 TF0.25 TT0.75

Background: Markov networks Random variable: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) Undirected graphical models give another way of defining a compact model of the joint distribution…via potential functions. ϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j x xx x AJϕ(a,j) FF20 FT1 TF0.1 TT0.4

Background ϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j x xx x AJϕ(a,j) FF20 FT1 TF0.1 TT0.4 clique potential … … … …

Motivation Inference Methods, Inference Rules Answers Queries Challenges for KR: Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), … Complex queries: “which Canadian hockey teams have won the Stanley Cup?” Learning: how to acquire and maintain knowledge and inference rules as well as how to use it In space of “flat” propositions corresponding to single random variables

Background H/T: “Probabilistic Logic Programming, De Raedt and Kersting ???

Outline Motivation Background – Logic – Probability – Combining logic and probabilities: MLNs ProPPR – Key ideas – Learning method – Results for parameter learning – Structure learning for ProPPR for KB completion – Joint IE and KB completion – Comparison to neural KBC models Beyond ProPPR – ….

Markov Networks: [Review] Undirected graphical models Cancer CoughAsthma Smoking Cancer Ф(S,C) False 4.5 FalseTrue 4.5 TrueFalse 2.7 True 4.5 H/T: Pedro Domingos x = vector x c = short vector

Markov Logic: Intuition A logical KB is a set of hard constraints on the set of possible worlds Let’s make them soft constraints: When a world violates a formula, It becomes less probable, not impossible Give each formula a weight (Higher weight  Stronger constraint) H/T: Pedro Domingos

Markov Logic: Definition A Markov Logic Network (MLN) is a set of pairs (F, w) where –F is a formula in first-order logic –w is a real number Together with a set of constants, it defines a Markov network with –One node for each grounding of each predicate in the MLN –One feature for each grounding of each formula F in the MLN, with the corresponding weight w H/T: Pedro Domingos

Example: Friends & Smokers H/T: Pedro Domingos

Example: Friends & Smokers H/T: Pedro Domingos

Example: Friends & Smokers H/T: Pedro Domingos

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) H/T: Pedro Domingos

Example: Friends & Smokers Cancer(A) Smokes(A)Smokes(B) Cancer(B) Two constants: Anna (A) and Bob (B) H/T: Pedro Domingos

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) H/T: Pedro Domingos

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) H/T: Pedro Domingos

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) H/T: Pedro Domingos

Markov Logic Networks MLN is template for ground Markov nets Probability of a world x : Weight of formula iNo. of true groundings of formula i in x H/T: Pedro Domingos

MLNs generalize many statistical models Special cases: –Markov networks –Markov random fields –Bayesian networks –Log-linear models –Exponential models –Max. entropy models –Gibbs distributions –Boltzmann machines –Logistic regression –Hidden Markov models –Conditional random fields Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.)

MLNs generalize logic programs Subsets of Herbrand base  domain of joint distribution Interpretation  element of the joint Consistency with all clauses A:-B1,…,Bk == “model of program”  compatibility with program as determined by clique potentials Reaches logic in the limit when potentials are infinite.

MLNs are expensive  Inference done by explicitly building a ground MLN –Herbrand base is huge for reasonable programs: It grows faster than the size of the DB of facts –You’d like to able to use a huge DB—NELL is O(10M) Inference on an arbitrary MLN is expensive: #P-complete –It’s not obvious how to restrict the template so the MLNs will be tractable

What’s the alternative? There are many probabilistic LPs: – Compile to other 0 th -order formats: (Bayesian LPs, ProbLog, ….), – Impose a distribution over proofs, not interpretations (Probabilistic Constraint LPs, Stochastic LPs, …): requires generating all proofs to answer queries, also a large space – Sample from the space of proofs (PRISM, Blog) Limited relational extensions to 0 th -order models (PRMs, RDTs, MEBNs, …) Probabilistic programming languages (Church, …) – Imperative languages for defining complex probabilistic models (Related LP work: PRISM)

Outline Motivation Background – Logic – Probability – Combining logic and probabilities: MLNs ProPPR – Key ideas – Learning method – Results for parameter learning – Structure learning for ProPPR for KB completion – Joint IE and KB completion – Comparison to neural KBC models Beyond ProPPR – ….