Markov Logic Networks: A Unified Approach To Language Processing Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with.

Slides:

Advertisements

Similar presentations

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Advertisements

Joint Inference in Information Extraction Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos)

Bayesian Belief Propagation

Markov Networks Alan Ritter.

Discriminative Training of Markov Logic Networks

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Coreference Based Event-Argument Relation Extraction on Biomedical Text Katsumasa Yoshikawa 1), Sebastian Riedel 2), Tsutomu Hirao 3), Masayuki Asahara.

CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.

Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.

Relational Representations Daniel Lowd University of Oregon April 20, 2015.

Supervised Learning Recap

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis Lecture)

Markov Logic Networks Instructor: Pedro Domingos.

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.

1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.

Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u

Adbuctive Markov Logic for Plan Recognition Parag Singla & Raymond J. Mooney Dept. of Computer Science University of Texas, Austin.

Markov Networks.

Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Jesse Davis, Stanley Kok,

Markov Logic: A Unifying Framework for Statistical Relational Learning Pedro Domingos Matthew Richardson

Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.

School of Computing Science Simon Fraser University Vancouver, Canada.

Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Recursive Random Fields Daniel Lowd University of Washington June 29th, 2006 (Joint work with Pedro Domingos)

CSE 574: Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Stanley Kok, Daniel Lowd,

Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.

Markov Logic: A Simple and Powerful Unification Of Logic and Probability Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.

Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.

1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington.

Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Pedro Domingos Dept. of Computer Science & Eng.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.

Markov Logic: A Unifying Language for Information and Knowledge Management Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.

Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.

Markov Logic And other SRL Approaches

Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,

Markov Logic and Deep Networks Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)

Modeling Speech Acts and Joint Intentions in Modal Markov Logic Henry Kautz University of Washington.

1 Markov Logic Stanley Kok Dept. of Computer Science & Eng. University of Washington Joint work with Pedro Domingos, Daniel Lowd, Hoifung Poon, Matt Richardson,

CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.

Lecture 2: Statistical learning primer for biologists

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

John Lafferty Andrew McCallum Fernando Pereira

CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.

Markov Logic Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.

1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:

An Introduction to Markov Logic Networks in Knowledge Bases

Learning Deep Generative Models by Ruslan Salakhutdinov

Markov Logic Networks for NLP CSCI-GA.2591

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30

Markov Networks.

Markov Networks.

Presentation transcript:

Markov Logic Networks: A Unified Approach To Language Processing Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Stanley Kok, Daniel Lowd, Hoifung Poon, Matt Richardson, Parag Singla, Marc Sumner, and Jue Wang

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Pipeline vs. Joint Architectures Most language processing systems have a pipeline architecture Simple, but errors accumulate We need joint inference across all stages Potentially much more accurate, but also much more complex

What We Need A common representation for all the stages A modeling language that enables this Efficient inference and learning algorithms Automatic compilation of model spec Makes language processing “plug and play”

Markov Logic Syntax: Weighted first-order formulas Semantics: Templates for Markov nets Inference: Lifted belief propagation Learning: Weights: Convex optimization Formulas: Inductive logic programming Applications: Coreference resolution, information extraction, semantic role labeling, ontology induction, etc.

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Markov Networks Undirected graphical models Cancer CoughAsthma Smoking Potential functions defined over cliques SmokingCancer Ф(S,C) False 4.5 FalseTrue 4.5 TrueFalse 2.7 True 4.5

Markov Networks Undirected graphical models Log-linear model: Weight of Feature iFeature i Cancer CoughAsthma Smoking

First-Order Logic Symbols: Constants, variables, functions, predicates E.g.: Anna, x, MotherOf(x), Friends(x, y) Logical connectives: Conjunction, disjunction, negation, implication, quantification, etc. Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World: Assignment of truth values to all ground atoms

Example: Heads and Appositions

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Markov Logic A logical KB is a set of hard constraints on the set of possible worlds Let’s make them soft constraints: When a world violates a formula, It becomes less probable, not impossible Give each formula a weight (Higher weight  Stronger constraint)

Definition A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w

Example: Heads and Appositions

Head(A,“Bush”) MentionOf(A,Bush) Head(A,“President”) MentionOf(B,Bush) Apposition(A,B) Head(B,“Bush”) Head(B,“President”) Two mention constants: A and B Apposition(B,A)

Markov Logic Networks MLN is template for ground Markov nets Probability of a world x : Typed variables and constants greatly reduce size of ground Markov net Functions, existential quantifiers, etc. Infinite and continuous domains Weight of formula iNo. of true groundings of formula i in x

Relation to Statistical Models Special cases: Markov networks Markov random fields Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields Obtained by making all predicates zero-arity Markov logic allows objects to be interdependent (non-i.i.d.)

Relation to First-Order Logic Infinite weights  First-order logic Satisfiable KB, positive weights  Satisfying assignments = Modes of distribution Markov logic allows contradictions between formulas

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Belief Propagation Goal: Compute probabilities or MAP state Belief propagation: Subsumes Viterbi, etc. Bipartite network Variables = Ground atoms Features = Ground formulas Repeat until convergence: Nodes send messages to their features Features send messages to their variables Messages = Approximate marginals

Belief Propagation Atoms (x) Formulas (f) MentionOf(A,Bush) MentionOf(A,Bush)  Apposition(A, B)  MentionOf(B,Bush)

Belief Propagation Atoms (x) Formulas (f)

Belief Propagation Atoms (x) Formulas (f)

But This Is Too Slow One message for each atom/formula pair Can easily have billions of formulas Too many messages! Group atoms/formulas which pass same message (as in resolution) One message for each pair of clusters Greatly reduces the size of the network

Belief Propagation Atoms (x) Formulas (f)

Lifted Belief Propagation Atoms (x) Formulas (f)

Lifted Belief Propagation   Atoms (x) Formulas (f)

Lifted Belief Propagation Form lifted network Supernode: Set of ground atoms that all send and receive same messages throughout BP Superfeature: Set of ground clauses that all send and receive same messages throughout BP Run belief propagation on lifted network Same results as ground BP Time and memory savings can be huge

Forming the Lifted Network 1. Form initial supernodes One per predicate and truth value (true, false, unknown) 2. Form superfeatures by doing joins of their supernodes 3. Form supernodes by projecting superfeatures down to their predicates Supernode = Groundings of a predicate with same number of projections from each superfeature 4. Repeat until convergence

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Learning Data is a relational database Learning parameters (weights) Supervised Unsupervised Learning structure (formulas)

Supervised Learning Maximizes conditional log-likelihood: Y : Query variables X : Evidence variables x, y : Observed values in training data

Supervised Learning Gradient: Use inference to compute E[N i ] Preconditioned scaled conjugate gradient (PSCG) [Lowd & Domingos, 2007] Number of true groundings of F i in training data Expected number of true groundings of F i

Unsupervised Learning Maximizes marginal cond. log-likelihood: Y : Query variables X : Evidence variables x, y : Observed values in the training data Z : Hidden variables

Unsupervised Learning Gradient: Use inference to compute both E[N i ]s Also works for semi-supervised learning

Structure Learning Generalizes feature induction in Markov nets Any inductive logic programming approach can be used, but... Goal is to induce any clauses, not just Horn Evaluation function should be likelihood Requires learning weights for each candidate Turns out not to be bottleneck Bottleneck is counting clause groundings Solution: Subsampling

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Applications NLP Information extraction Coreference resolution Citation matching Semantic role labeling Ontology induction Etc. Others Social network analysis Robot mapping Computational biology Probabilistic Cyc CALO Etc.

Coreference Resolution Identifies noun phrases (mentions) that refer to the same entity Can be viewed as clustering the mentions (each entity is a cluster) Key component in NLP applications

State of the Art Supervised learning Classification (e.g., are two mentions coreferent?) Requires expensive labeling Unsupervised learning Still lags supervised approaches by a large margin E.g., Haghighi & Klein [2007] Most sophisticated to date Lags supervised methods by as much as 7% F1 Generative model  Nontrivial to extend with arbitrary dependencies

This Talk First unsupervised coreference resolution system that rivals supervised approaches

MLNs for Coreference Resolution Goal: Infer the truth values of MentionOf(m, e) for every mention m and entity e Base MLN Joint inference Appositions Predicate nominals Full MLN  Base  Joint Inference Rule-based model

Base MLN: Formulas Non-pronouns: Head mixture model E.g., mentions of first entity are often headed by “Bush” Pronouns: Preference in type, number, gender E.g., “it” often refers to an organization Entity properties E.g., the first entity may be a person Mentions for the same entity must agree in type, number, and gender 9 predicates 17 formulas No. weights  O(No. entities)

Base MLN: Exponential Priors Prior on total number of entities: weight   1 ( per entity) Prior on distance between each pronoun and its closest antecedent: weight   1 ( per pronominal mention)

Joint Inference Appositions E.g., “ Mr. Bush, the President of the U.S.A., … ” Predicate nominals E.g., “ Mr. Bush is the President of the U.S.A. ” Joint inference: Mentions that are appositions or predicate nominals usually refer to the same entity

Rule-Based Model Cluster non-pronouns with same heads Place each pronoun in the entity with The closest antecedent No known conflicts in type, number, gender Can be encoded in MLN with just four formulas No learning Suffices to outperform Haghighi & Klein [2007]

Unsupervised Learning Maximizes marginal cond. log-likelihood: Y : Query variables X : Evidence variables x, y : Observed values in the training data Z : Hidden variables

Unsupervised Learning for Coreference Resolution MLNs Y : Heads, known properties X : Pronoun, apposition, predicate nominal Z : Coreference assignment ( MentionOf ), unknown properties

Evaluation Datasets Metrics Systems Results Analysis

Datasets MUC-6 ACE-2004 training corpus ACE Phase II (ACE-2)

Metrics Precision, recall, F1 (MUC, B 3, Pairwise) Mean absolute error in number of entities

Systems: Recent Approaches Unsupervised: Haghighi & Klein [2007] Supervised: McCallum & Wellner [2005] Ng [2005] Denis & Baldridge [2007]

Systems: MLNs Rule-based model (RULE) Base MLN MLN-1: trained on each document itself MLN-30: trained on 30 test documents together Better head determination (-H) Joint inference with appositions (-A) Joint inference with predicate nominals (-N)

Results: MUC-6 HK-60 HK-381 F1

Results: MUC-6 HK-60 HK-381RULE F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 MLN-30 F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 MLN-30 MLN-H F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 MLN-30 MLN-H MLN-HA F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 MLN-30 MLN-H MLN-HA FULL F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 MLN-30 MLN-H MLN-HA FULL RULE-HAN F1

Results: MUC-6 HK-60 HK-381RULE MLN-1 MLN-30 MLN-H MLN-HA FULL RULE-HAN M&W F1

Results: ACE-2004 F1

Results: ACE-2 F1

Comparison with Previous Approaches Cluster-based Simpler modeling for salience  Requires less training data Identify heads using head rules E.g., “the President of the USA” Leverage joint inference E.g., “Mr. Bush, the President …”

Error Analysis Features beyond the head E.g., the Finance Committee, the Defense Committee Speech pronouns, quotes, … E.g., I, we, you; “I am not Bush”, McCain said … Identify appositions and predicate nominals E.g., “Mike Sullivan, VOA News” Context and world knowledge E.g., “the White House”

Overview Motivation Background Markov logic Inference Learning Applications Coreference resolution Discussion

Conclusion Pipeline architectures accumulate errors Joint inference is complex for human and machine Markov logic provides language and algorithms Weighted first-order formulas → Markov network Inference: Lifted belief propagation Learning: Convex optimization and ILP Several successes to date First unsupervised coreference resolution system that rivals supervised ones Next steps: Combine more stages of the pipeline Open-source software: Alchemy alchemy.cs.washington.edu