Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on joint work with Stanley Kok, Matt Richardson and Parag Singla)
Overview Motivation Background Representation Inference Learning Applications Discussion
Motivation Natural language is characterized by Complex relational structure High uncertainty (ambiguity, imperfect knowledge) First-order logic handles relational structure Probability handles uncertainty Let’s combine the two
Markov Logic [Richardson & Domingos, 2006] Syntax: First-order logic + Weights Semantics: Templates for Markov nets Inference: Weighted satisfiability + MCMC Learning: Voted perceptron + ILP
Overview Motivation Background Representation Inference Learning Applications Discussion
Markov Networks Undirected graphical models B DC A Potential functions defined over cliques
Markov Networks Undirected graphical models B DC A Potential functions defined over cliques Weight of Feature iFeature i
First-Order Logic Constants, variables, functions, predicates E.g.: Anna, X, mother_of(X), friends(X, Y) Grounding: Replace all variables by constants E.g.: friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground predicates
Overview Motivation Background Representation Inference Learning Applications Discussion
Markov Logic Networks A logical KB is a set of hard constraints on the set of possible worlds Let’s make them soft constraints: When a world violates a formula, It becomes less probable, not impossible Give each formula a weight (Higher weight Stronger constraint)
Definition A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w
Example: Friends & Smokers Cancer(A) Smokes(A)Smokes(B) Cancer(B) Suppose we have two constants: Anna (A) and Bob (B)
Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Suppose we have two constants: Anna (A) and Bob (B)
Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Suppose we have two constants: Anna (A) and Bob (B)
Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Suppose we have two constants: Anna (A) and Bob (B)
More on MLNs MLN is template for ground Markov nets Typed variables and constants greatly reduce size of ground Markov net Functions, existential quantifiers, etc. MLN without variables = Markov network (subsumes graphical models)
Relation to First-Order Logic Infinite weights First-order logic Satisfiable KB, positive weights Satisfying assignments = Modes of distribution MLNs allow contradictions between formulas
Overview Motivation Background Representation Inference Learning Applications Discussion
MPE/MAP Inference Find most likely truth values of non-evidence ground atoms given evidence Apply weighted satisfiability solver (maxes sum of weights of satisfied clauses) MaxWalkSat algorithm [Kautz et al., 1997] Start with random truth assignment With prob p, flip atom that maxes weight sum; else flip random atom in unsatisfied clause Repeat n times Restart m times
Conditional Inference P(Formula|MLN,C) = ? MCMC: Sample worlds, check formula holds P(Formula1|Formula2,MLN,C) = ? If Formula2 = Conjunction of ground atoms First construct min subset of network necessary to answer query (generalization of KBMC) Then apply MCMC (or other)
Ground Network Construction Initialize Markov net to contain all query preds For each node in network Add node’s Markov blanket to network Remove any evidence nodes Repeat until done
Probabilistic Inference Recall Exact inference is #P-complete Conditioning on Markov blanket is easy: Gibbs sampling exploits this
Markov Chain Monte Carlo Gibbs Sampler 1. Start with an initial assignment to nodes 2. One node at a time, sample node given others 3. Repeat 4. Use samples to compute P(X) Apply to ground network Initialization: MaxWalkSat Can use multiple chains
Overview Motivation Background Representation Inference Learning Applications Discussion
Learning Data is a relational database Closed world assumption (if not: EM) Learning parameters (weights) Generatively: Pseudo-likelihood Discriminatively: Voted perceptron + MaxWalkSat Learning structure Generalization of feature induction in Markov nets Learn and/or modify clauses Inductive logic programming with pseudo- likelihood as the objective function
Generative Weight Learning Maximize likelihood (or posterior) Use gradient ascent Requires inference at each step (slow!) Feature count according to data Feature count according to model
Pseudo-Likelihood [Besag, 1975] Likelihood of each variable given its Markov blanket in the data Does not require inference at each step Widely used
Most terms not affected by changes in weights After initial setup, each iteration takes O(# ground predicates x # first-order clauses) Optimization where nsat i (x=v) is the number of satisfied groundings of clause i in the training data when x takes value v Parameter tying over groundings of same clause Maximize using L-BFGS [Liu & Nocedal, 1989]
Gradient of Conditional Log Likelihood # true groundings of formula in DB Expected # of true groundings – slow! Approximate expected count by MAP count Discriminative Weight Learning
Used for discriminative training of HMMs Expected count in gradient approximated by count in MAP state MAP state found using Viterbi algorithm Weights averaged over all iterations Voted Perceptron [Collins, 2002] initialize w i =0 for t=1 to T do find the MAP configuration using Viterbi w i, = * (training count – MAP count) end for
HMM is special case of MLN Expected count in gradient approximated by count in MAP state MAP state found using MaxWalkSat Weights averaged over all iterations Voted Perceptron for MLNs [Singla & Domingos, 2004] initialize w i =0 for t=1 to T do find the MAP configuration using MaxWalkSat w i, = * (training count – MAP count) end for
Overview Motivation Background Representation Inference Learning Applications Discussion
Applications to Date Entity resolution (Cora, BibServ) Information extraction for biology (won LLL-2005 competition) Probabilistic Cyc Link prediction Topic propagation in scientific communities Etc.
Entity Resolution Most logical systems make unique names assumption What if we don’t? Equality predicate: Same(A,B), or A = B Equality axioms Reflexivity, symmetry, transitivity For every unary predicate P: x1 = x2 => (P(x1) P(x2)) For every binary predicate R: x1 = x2 y1 = y2 => (R(x1,y1) R(x2,y2)) Etc. But in Markov logic these are soft and learnable Can also introduce reverse direction: R(x1,y1) R(x2,y2) x1 = x2 => y1 = y2 Surprisingly, this is all that’s needed
Example: Citation Matching
Markov Logic Formulation: Predicates Are two bibliography records the same? SameBib(b1,b2) Are two field values the same? SameAuthor(a1,a2) SameTitle(t1,t2) SameVenue(v1,v2) How similar are two field strings? Predicates for ranges of cosine TF-IDF score: TitleTFIDF.0(t1,t2) is true iff TF-IDF(t1,t2)=0 TitleTFIDF.2(a1,a2) is true iff 0 <TF-IDF(a1,a2) < 0.2 Etc.
Markov Logic Formulation: Formulas Unit clauses (defaults): ! SameBib(b1,b2) Two fields are same => Corresponding bib. records are same: Author(b1,a1) Author(b2,a2) SameAuthor(a1,a2) => SameBib(b1,b2) Two bib. records are same => Corresponding fields are same: Author(b1,a1) Author(b2,a2) SameBib(b1,b2) => SameAuthor(a1,a2) High similarity score => Two fields are same: TitleTFIDF.8(t1,t2) =>SameTitle(t1,t2) Transitive closure (not incorporated in experiments): SameBib(b1,b2) SameBib(b2,b3) => SameBib(b1,b3) 25 predicates, 46 first-order clauses
What Does This Buy You? Objects are matched collectively Multiple types matched simultaneously Constraints are soft, and strengths can be learned from data Easy to add further knowledge Constraints can be refined from data Standard approach still embedded
Example RecordTitleAuthorVenue B1Object Identification using CRFsLinda StewartPKDD 04 B2Object Identification using CRFsLinda Stewart8 th PKDD B3Learning Boolean FormulasBill JohnsonPKDD 04 B4Learning of Boolean FormulasWilliam Johnson8 th PKDD Subset of a Bibliography Database
Standard Approach [Fellegi & Sunter, 1969] b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue Sim(PKDD 04, 8 th PKDD) Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD) Venue record-match node field-similarity node (evidence node)
What’s Missing? b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue Sim(PKDD 04, 8 th PKDD) Sim(Object Identification using CRF, Object Identification using CRF) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD) Venue If from b1=b2, you infer that “PKDD 04” is same as “8th PKDD”, how can you use that to help figure out if b3=b4?
Merging the Evidence Nodes Author Still does not solve the problem. Why? b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD)
Introducing Field-Match Nodes b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue b1.T=b2.T? b1.V=b2.V? b3.V=b4.V? b3.A=b4.A? b3.T=b4.T? b1.A=b2.A? Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) field-match node Full representation in Collective Model Sim(PKDD 04, 8 th PKDD)
Flow of Information b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue b1.T=b2.T? b1.V=b2.V? b3.V=b4.V? b3.A=b4.A? b3.T=b4.T? b1.A=b2.A? Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD)
Flow of Information b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue b1.T=b2.T? b1.V=b2.V? b3.V=b4.V? b3.A=b4.A? b3.T=b4.T? b1.A=b2.A? Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD)
Flow of Information b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue b1.T=b2.T? b1.V=b2.V? b3.V=b4.V? b3.A=b4.A? b3.T=b4.T? b1.A=b2.A? Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD)
Flow of Information b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue b1.T=b2.T? b1.V=b2.V? b3.V=b4.V? b3.A=b4.A? b3.T=b4.T? b1.A=b2.A? Sim(Object Identification using CRF, Object Identification using CRF) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD)
Flow of Information b1=b2 ? Sim(Linda Stewart, Linda Stewart) b3=b4 ? Author Title Venue b1.T=b2.T? b1.V=b2.V? b3.V=b4.V? b3.A=b4.A? b3.T=b4.T? b1.A=b2.A? Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Bill Johnson, William Johnson) Title Author Sim(Learning Boolean Formulas, Learning of Boolean Expressions) Sim(PKDD 04, 8 th PKDD)
Experiments Databases: Cora [McCallum et al., IRJ, 2000]: 1295 records, 132 papers BibServ.org [Richardson & Domingos, ISWC-03]: 21,805 records, unknown #papers Goal: De-duplicate bib.records, authors and venues Pre-processing: Form canopies [McCallum et al, KDD-00 ] Compared with naïve Bayes (standard method), etc. Measured area under precision-recall curve (AUC) Our approach wins across the board
Results: Matching Venues on Cora
Overview Motivation Background Representation Inference Learning Applications Discussion
Relation to Other Approaches RepresentationLogical language Probabilistic language Markov logicFirst-order logicMarkov nets RMNsConjunctive queries Markov nets PRMsFrame systemsBayes nets KBMCHorn clausesBayes nets SLPsHorn clausesBayes nets
Going Further First-order logic is not enough We can “Markovize” other representations in the same way Lots to do
Summary NLP involves relational structure, uncertainty Markov logic combines first-order logic and probabilistic graphical models Syntax: First-order logic + Weights Semantics: Templates for Markov networks Inference: MaxWalkSat + KBMC + MCMC Learning: Voted perceptron + PL + ILP Applications to date: Entity resolution, IE, etc. Software: Alchemy