Scalable Statistical Relational Learning for NLP

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Computer Science CPSC 322 Lecture 25 Top Down Proof Procedure (Ch 5.2.2)
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Markov Networks Alan Ritter.
Discriminative Training of Markov Logic Networks
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.
Answer Set Programming Overview Dr. Rogelio Dávila Pérez Profesor-Investigador División de Posgrado Universidad Autónoma de Guadalajara
Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.
Everything You Need to Know (since the midterm). Diagnosis Abductive diagnosis: a minimal set of (positive and negative) assumptions that entails the.
Markov Logic Networks Instructor: Pedro Domingos.
Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.
CPSC 322, Lecture 19Slide 1 Propositional Logic Intro, Syntax Computer Science cpsc322, Lecture 19 (Textbook Chpt ) February, 23, 2009.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.
CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.
Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.
Today Logistic Regression Decision Trees Redux Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Markov Logic And other SRL Approaches
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)
LAC group, 16/06/2011. So far...  Directed graphical models  Bayesian Networks Useful because both the structure and the parameters provide a natural.
First-Order Logic and Inductive Logic Programming.
CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:
Look, Ma, No Neurons! Knowledge Base Completion Using Explicit Inference Rules William W Cohen Machine Learning Department Carnegie Mellon University joint.
Logical Agents. Outline Knowledge-based agents Logic in general - models and entailment Propositional (Boolean) logic Equivalence, validity, satisfiability.
Learning Bayesian Networks for Complex Relational Data
New Rules for Domain Independent Lifted MAP Inference
The NP class. NP-completeness
Lecture 7: Constrained Conditional Models
An Introduction to Markov Logic Networks in Knowledge Bases
Learning Deep Generative Models by Ruslan Salakhutdinov
Inference in Bayesian Networks
The Propositional Calculus
Markov Logic Networks for NLP CSCI-GA.2591
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 29
First-Order Logic and Inductive Logic Programming
Logic for Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 20
Hidden Markov Models Part 2: Algorithms
Lifted First-Order Probabilistic Inference [de Salvo Braz, Amir, and Roth, 2005] Daniel Lowd 5/11/2005.
Markov Networks.
Logic Programming & LPMLN
Discriminative Probabilistic Models for Relational Data
Institut für Informationssysteme
Markov Networks.
Probabilistic Databases with MarkoViews
Representations & Reasoning Systems (RRS) (2.2)
CS249: Neural Language Model
Presentation transcript:

Scalable Statistical Relational Learning for NLP William Wang CMU  UCSB William Cohen CMU

Outline Motivation/Background Logic Probability Combining logic and probabilities: Inference and semantics: MLNs Probabilistic DBs and the independent-tuple mechanism Recent research ProPPR – a scalable probabilistic logic Structure learning Applications: knowledge-base completion Joint learning Cutting-edge research ….

Motivation - 1 Surprisingly many tasks in NLP can be mostly solved with data, learning, and not much else: E.g., document classification, document retrieval Some can’t e.g., semantic parse of sentences like “What professors from UCSD have founded startups that were sold to a big tech company based in the Bay Area?” We seem to need logic: { X : founded(X,Y), startupCompany(Y), acquiredBy(Y,Z), company(Z), big(Z), headquarters(Z,W), city(W), bayArea(W) }

Motivation Surprisingly many tasks in NLP can be mostly solved with data, learning, and not much else: E.g., document classification, document retrieval Some can’t e.g., semantic parse of sentences like “What professors from UCSD have founded startups that were sold to a big tech company based in the Bay Area?” We seem to need logic as well as uncertainty: { X : founded(X,Y), startupCompany(Y), acquiredBy(Y,Z), company(Z), big(Z), headquarters(Z,W), city(W), bayArea(W) } Logic and uncertainty have long histories and mostly don’t play well together

Motivation – 2 The results of NLP are often expressible in logic The results of NLP are often uncertain Logic and uncertainty have long histories and mostly don’t play well together

KR & Reasoning … What if the DB/KB or inference rules are imperfect? Inference Methods, Inference Rules Queries … Answers Challenges for KR: Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), … Complex queries: “which Canadian hockey teams have won the Stanley Cup?” Learning: how to acquire and maintain knowledge and inference rules as well as how to use it “Expressive, probabilistic, efficient: pick any two” Current state of the art

Three Areas of Data Science Representation Scalability Machine Learning Probabilistic logics, Representation learning Abstract Machines, Binarization Scalable Statistical Relational Learning Scalable Learning

Outline Motivation/Background Logic Probability Combining logic and probabilities: Inference and semantics: MLNs Probabilistic DBs and the independent-tuple mechanism Recent research ProPPR – a scalable probabilistic logic Structure learning Applications: knowledge-base completion Joint learning Cutting-edge research ….

Background: Logic Programs A program with one definite clause (Horn clauses): grandparent(X,Y) :- parent(X,Z),parent(Z,Y) Logical variables: X,Y,Z Constant symbols: bob, alice, … We’ll consider two types of clauses: Horn clauses A:-B1,…,Bk with no constants Unit clauses A:- with no variables (facts): parent(alice,bob):- or parent(alice,bob) head “neck” body Intensional definition, rules Extensional definition, database H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Logic Programs A program with one definite clause: grandparent(X,Y) :- parent(X,Z),parent(Z,Y) Logical variables: X,Y,Z Constant symbols: bob, alice, … Predicates: grandparent, parent Alphabet: set of possible predicates and constants Atomic formulae: parent(X,Y), parent(alice,bob) Ground atomic formulae: parent(alice,bob), … H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Logic Programs The set of all ground atomic formulae (consistent with a fixed alphabet) is the Herbrand base of a program: {parent(alice,alice),parent(alice,bob),…,parent(zeke,zeke),grandparent(alice,alice),…} The interpretation of a program is a subset of the Herbrand base. An interpretation M is a model of a program if For any A:-B1,…,Bk in the program and any mapping Theta from the variables in A,B1,..,Bk to constants: If Theta(B1) in M and … and Theta(Bk) in M then Theta(A) in M (i.e., M deductively closed) A program defines a unique least Herbrand model H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Logic Programs A program defines a unique least Herbrand model Example program: grandparent(X,Y):-parent(X,Z),parent(Z,Y). parent(alice,bob). parent(bob,chip). parent(bob,dana). The least Herbrand model also includes grandparent(alice,dana) and grandparent(alice,chip). Finding the least Herbrand model: theorem proving… Usually we case about answering queries: What are values of W: grandparent(alice,W) ? H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Inference Methods, Inference Rules Motivation Inference Methods, Inference Rules Queries {T : query(T) } ? Answers Challenges for KR: Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), … Complex queries: “which Canadian hockey teams have won the Stanley Cup?” Learning: how to acquire and maintain knowledge and inference rules as well as how to use it query(T):- play(T,hockey), hometown(T,C), country(C,canada)

Background: Probabilistic Inference Random variable: burglary, earthquake, … Usually denote with upper-case letters: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) B E A J M prob T 0.00001 F 0.03723 … H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Bayes networks Random variable: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) Directed graphical models give one way of defining a compact model of the joint distribution: Queries: Pr(A=t|J=t,M=f) ? A J Prob(J|A) F 0.95 T 0.05 0.25 0.75 A M Prob(J|A) F 0.80 … H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background Random variable: B,E,A,J,M Prob(J|A) F 0.95 T 0.05 0.25 0.75 Random variable: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) Directed graphical models give one way of defining a compact model of the joint distribution: Queries: Pr(A=t|J=t,M=f) ? H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Background: Markov networks Random variable: B,E,A,J,M Joint distribution: Pr(B,E,A,J,M) Undirected graphical models give another way of defining a compact model of the joint distribution…via potential functions. ϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j x x x x A J ϕ(a,j) F 20 T 1 0.1 0.4

Background x x x x … … clique potential A J ϕ(a,j) F 20 T 1 0.1 0.4 ϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j

Another example Smoking Cancer Asthma Cough Undirected graphical models [h/t Pedro Domingos] Smoking Cancer Asthma Cough x = vector Smoking Cancer Ф(S,C) False 4.5 True 2.7 xc = short vector H/T: Pedro Domingos

Motivation In space of “flat” propositions corresponding random variables Inference Methods, Inference Rules Queries Answers Challenges for KR: Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), … Complex queries: “which Canadian hockey teams have won the Stanley Cup?” Learning: how to acquire and maintain knowledge and inference rules as well as how to use it

Outline Motivation/Background Logic Probability Combining logic and probabilities: Inference and semantics: MLNs Probabilistic DBs and the independent-tuple mechanism Recent research ProPPR – a scalable probabilistic logic Structure learning Applications: knowledge-base completion Joint learning Cutting-edge research

Three Areas of Data Science Representation Scalability Machine Learning Probabilistic logics, Representation learning Abstract Machines, Binarization MLNs Scalable Learning

Another example Smoking Cancer Asthma Cough Undirected graphical models [h/t Pedro Domingos] Smoking Cancer Asthma Cough x = vector Smoking Cancer Ф(S,C) False 4.5 True 2.7

Another example Smoking Cancer Asthma Cough Undirected graphical models [h/t Pedro Domingos] Smoking Cancer Asthma Cough x = vector Smoking Cancer Ф(S,C) False 1.0 True 0.1 A soft constraint that smoking  cancer

Markov Logic: Intuition [Domingos et al] A logical KB is a set of hard constraints on the set of possible worlds constrained to be deductively closed Let’s make closure a soft constraints: When a world is not deductively closed, It becomes less probable Give each rule a weight which is a reward for satisfying it: (Higher weight  Stronger constraint)

Markov Logic: Definition A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN – each element of the Herbrand base One feature for each grounding of each formula F in the MLN, with the corresponding weight w H/T: Pedro Domingos

Example: Friends & Smokers H/T: Pedro Domingos

Example: Friends & Smokers H/T: Pedro Domingos

Example: Friends & Smokers H/T: Pedro Domingos

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) H/T: Pedro Domingos

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Smokes(A) Smokes(B) Cancer(A) Cancer(B) H/T: Pedro Domingos

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) H/T: Pedro Domingos

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) H/T: Pedro Domingos

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A) H/T: Pedro Domingos

Markov Logic Networks MLN is template for ground Markov nets Probability of a world x: Weight of formula i No. of true groundings of formula i in x Recall for ordinary Markov net H/T: Pedro Domingos

MLNs generalize many statistical models  Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.) Special cases: Markov networks Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields H/T: Pedro Domingos

MLNs generalize logic programs  Subsets of Herbrand base ~ domain of joint distribution Interpretation ~ element of the joint Consistency with all clauses A:-B1,…,Bk , i.e. “model of program” ~ compatibility with program as determined by clique potentials Reaches logic in the limit when potentials are infinite (sort of) H/T: Pedro Domingos

MLNs are expensive  Inference done by explicitly building a ground MLN Herbrand base is huge for reasonable programs: It grows faster than the size of the DB of facts You’d like to able to use a huge DB—NELL is O(10M) After that inference on an arbitrary MLN is expensive: #P-complete It’s not obvious how to restrict the template so the MLNs will be tractable Possible solution: PSL (Getoor et al), which uses hinge-loss leading to a convex optimization task

What are the alternatives? There are many probabilistic LPs: Compile to other 0th-order formats: (Bayesian LPs – replace undirected model with directed one), Impose a distribution over proofs, not interpretations (Probabilistic Constraint LPs, Stochastic LPs, …): requires generating all proofs to answer queries, also a large space Limited relational extensions to 0th-order models (PRMs, RDTs,,…) Probabilistic programming languages (Church, …) Imperative languages for defining complex probabilistic models (Related LP work: PRISM) Probabilistic Deductive Databases

Recap: Logic Programs A program with one definite clause (Horn clauses): grandparent(X,Y) :- parent(X,Z),parent(Z,Y) Logical variables: X,Y,Z Constant symbols: bob, alice, … We’ll consider two types of clauses: Horn clauses A:-B1,…,Bk with no constants Unit clauses A:- with no variables (facts): parent(alice,bob):- or parent(alice,bob) head “neck” body Intensional definition, rules Extensional definition, database H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Actually all constants are only in the database A PrDDB Actually all constants are only in the database Confidences/numbers are associated with DB facts, not rules

r3. status(X,tired) :- child(W,X), infant(W) {r3}. A PrDDB Old trick: (David Poole?) If you want to weight a rule you can introduce a rule-specific fact…. weighted(r3),0.88 r3. status(X,tired) :- child(W,X), infant(W), weighted(r3). r3. status(X,tired) :- child(W,X), infant(W) {r3}. So learning rule weights is a special case of learning weights for selected DB facts (and vice-versa)

Simplest Semantics for a PrDDB Pick a hard database I from some distribution D over databases. The tuple-independence models says: just toss a biased coin for each “soft” fact. Compute the ordinary deductive closure (the least model) of I . Define Pr( fact f ) = Pr( closure(I ) contains fact f ) Pr(I | D)

Simplest Semantics for a PrDDB the weight associated with fact f’

Implementing the independent tuple model An explanation of a fact f is some minimal subset of the DB facts which allows you to conclude f using the theory. You can generate all possible explanations Ex(f) of fact f using a theorem prover Ex(status(eve,tired)) = { { child(liam,eve),infant(liam) } , { child(dave,eve),infant(dave) } }

Implementing the independent tuple model An explanation of a fact f is some minimal subset of the DB facts which allows you to conclude f using the theory. You can generate all possible explanations Ex(f) of fact f using a theorem prover Ex (status(bob,tired)) = { { child(liam,bob),infant(liam) } }

Implementing the independent tuple model An explanation of a fact f is some minimal subset of the DB facts which allows you to conclude f using the theory. You can generate all possible explanations using a theorem prover The tuple-independence score for a fact, Pr(f), depends only on the explanations! Key step:

Implementing the independent tuple model If there’s just one explanation we’re home free…. If there are many explanations we can compute by adding up this quantity for each explanation E… TO DO: insert the latex.. …except, of course that this double-counts interpretations that are supersets of two or more explanations ….

Implementing the independent tuple model If there’s just one explanation we’re home free…. If there are many explanations we can compute I TO DO: insert the latex.. This is not easy: Basically the counting gets hard (#P-hard) when explanations overlap. This makes sense: we’re looking at overlapping conjunctions of independent events.

Implementing the independent tuple model An explanation of a fact f is some minimal subset of the DB facts which allows you to conclude f using the theory. You can generate all possible explanations using a theorem prover Ex (status(bob,tired)) = { { child(liam,bob),infant(liam) } }

Implementing the independent tuple model Ex (status(bob,tired)) = { { child(dave,eve), husband(eve,bob), infant(dave) }, { child(liam,bob), infant(liam) }, { child(liam,eve), husband(eve,bob), infant(liam) } }

A torture test for the independent tuple model [de Raedt et al] Each edge is a DB fact e(cell1,cell2) Prove: pathBetween(x,y) Proofs reuse the same DB tuples Keeping track of all the proofs and tuple-reuse is expensive…. ProbLog2

Beyond the tuple-independence model? There are smart ways to speed up the weighted-proof counting you need to do… But it’s still hard…and the input can be huge There’s a lot of work on extending the independent tuple mode E.g., introducing multinomial random variables to chose between related facts like: age(dave,infant), age(dave,toddler), … age(dave,adult), E.g., using MLNs to characterize the dependencies between facts in the DB There’s not much work on cheaper models…

What are the alternatives? There are many probabilistic LPs: Compile to other 0th-order formats: (Bayesian LPs – replace undirected model with directed one), Impose a distribution over proofs, not interpretations (Probabilistic Constraint LPs, Stochastic LPs, …): requires generating all proofs to answer queries, also a large space …but at least scoring in that space is efficient

Outline Motivation/Background Logic Probability Combining logic and probabilities: Inference and semantics: MLNs Probabilistic DBs and the independent-tuple mechanism Recent research ProPPR – a scalable probabilistic logic Structure learning Applications: knowledge-base completion Joint learning Cutting-edge research ….

Key References for Part 1 Probabilistic logics that are converted to 0-th order models: u et al, Probabilistic Databases, Morgan Claypool 2011 Fierens, … de Raedt, Inference and Learning in Probabilistic Logic Programs using Weighted Boolean Formulas, to appear (ProbLog2 paper) Sen, …Getoor, PrDB: Managing and Exploiting Rich Correlations in Probabilistic DBs, VLDB 18(6) 2006 Stochastic Logic Programs: Cussens, Parameter Estimation in SLPs, MLJ 44(3), 2001 Kimmig,…,Getoor: Lifted graphical models: a survey, MLJ 99(1), 1999 MLNs: Markov logic networks, MLJ 62(1-2), 2006. Also a book in the Morgan Claypool Synthesis series. PSL: Probabilistic similarity logic, Brocheler, …,Getoor, UAI 2010 Independent tuple model and extensions: Poole, The independent choice logic for modelling multiple agents under uncertainty, AIJ 94(1), 1997