Markov Logic Networks Hao Wu Mariyam Khalid
Motivation
How would we model this scenario?
Motivation How would we model this scenario? – Logical Approach
Motivation How would we model this scenario? – Logical Approach – Statistical Approach
First Order Logic Four types of symbols: Constants:concrete object in the domain (e.g., people: Anna, Bob) Variables:range over the objects in the domain Functions:Mapping from tuples of objects to objects (e.g., GrandpaOf) Predicates:relations among objects in the domain (e.g., Friends) or attributes of objects (e.g. Fired)
First Order Logic Four types of symbols: Constants:concrete object in the domain (e.g., people: Anna, Bob) Variables:range over the objects in the domain Functions:Mapping from tuples of objects to objects (e.g., GrandpaOf) Predicates:relations among objects in the domain (e.g., Friends) or attributes of objects (e.g. Fired) Logical connectives and quantifier:
First Order Logic
Advantages: Compact representation a wide variety of knowledge Flexible and modularly incorporate a wide range of domain knowledge
First Order Logic Advantages: Compact representation a wide variety of knowledge Flexible and modularly incorporate a wide range of domain knowledge Disadvantages: No possibility to handle uncertainty No handling of imperfection and contradictory knowledge
Markov Networks Set of variables: The distribution is given by: with as normalization factor and as the potential function
Markov Networks Representation as log-linear model:
Markov Networks Representation as log-linear model: In our case there will be only binary features: – Each feature corresponds to each possible state The weight is equal to the log of the potential:
Markov Networks Has playing Friend Plays Fired
Markov Networks
Whether an employee A can convince another employee B to play, depends on the liability of B. For a high liability of B there is a higher probability ( ω =4) than for a low liability ( ω =2)
Markov Networks
Advantages: Efficiently handling uncertainty Tolerant against imperfection and contradictory knowledge
Markov Network Advantages: Efficiently handling uncertainty Tolerant against imperfection and contradictory knowledge Disadvantages: Very complex networks for a wide variety of knowledge Difficult to incorporate a wide range of domain knowledge
Motivation
Ideally we want a framework that can incorporate the advantages of both
Markov Logic Networks Description of the problem Translation in First-Order Logic Construction of a MLN- ”Template” Derive a concrete MLN for a given Set of Constants Compute whatever you want
Markov Logic Networks Description of the problem Translation in First-Order Logic Construction of a MLN- ”Template” Derive a concrete MLN for a given Set of Constants Compute whatever you want
Markov Logic Networks
Description of the problem Translation in First-Order Logic Construction of a MLN- ”Template” Derive a concrete MLN for a given Set of Constants Compute whatever you want
Markov Logic Networks Each formula matches one clique Each formula owns a weight that reflects the importance of this formula If a world violates one formula then it is less probable but not impossible
Markov Logic Networks Description of the problem Translation in First-Order Logic Construction of a MLN- ”Template” Derive a concrete MLN for a given Set of Constants Compute whatever you want
Markov Networks Markov Logic Networks28 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A)Plays(B) Constants: Alice (A) and Bob (B)
Markov Logic Network Markov Logic Networks29 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A)Plays(B) Fired(B)Fired(A) Friends(x,y)Plays(x)Plays(y)ω True 3 FalseTrue0 FalseTrue 3 False True3 False0 TrueFalse 3 TrueFalse3 3 Plays(x)Fired(x)ω True 2 False0 True0 False 2
MAP/MPE Inference Giving evidences, find the most possible state.
MAP/MPE Inference Giving evidences, find the most possible state. Let x be evidence
MAP/MPE Inference Let x be evidence A weighted MaxSAT problem.
WalkSAT for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if all clauses satisfied then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes number of satisfied clauses return failure
MaxWalkSAT for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if ∑ weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found
LazySAT MaxWalkSAT may need a lot of memory.
LazySAT MaxWalkSAT may need a lot of memory. Most network are sparse. Exploit sparseness; ground clauses lazily
for i ← 1 to max-tries do active_atoms ← atoms in clauses unsatisfied by DB active_clauses ← clauses activated by active_atoms soln = random truth assignment to active_atoms for j ← 1 to max-flips do if ∑ weights(sat. clauses) ≥ threshold then return soln c ← random unsatisfied clause with probability p v f ← a randomly chosen variable from c else for each variable v in c do compute DeltaGain(v), using weighted_KB if v f active_atoms v f ← v with highest DeltaGain(v) if v f active_atoms then activate v f and add clauses activated by v f soln ← soln with v f flipped return failure, best soln found
Inference What is P(Formula1|Formula2,M L,C )
Inference What is P(Formula1|Formula2,M L,C )
Inference However directly compute this equation is intractable in most of the cases
Inference First we need to construct a minimal network for each set of evidence.
Inference First we need to construct a minimal network for each set of evidence. network ← Ø queue ← query nodes repeat node ← dequeue(queue) add node to network if node not in evidence then add neighbors(node) to queue until queue = Ø
Example Markov Logic Networks43 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A)Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A)Fired(B)
Example Markov Logic Networks44 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A)Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A) Fired(B)
Example Markov Logic Networks45 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A) Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A) Fired(B)
Example Markov Logic Networks46 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A) Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A) Fired(B)
Example Markov Logic Networks47 Friends(A,A) Friends(B,B) Friends(A,B) Friends(B,A) Plays(A) Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A) Fired(B)
Example Markov Logic Networks48 Friends(A,A) Friends(B,B) Friends(A,B) Friends(B,A) Plays(A) Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A) Fired(B)
Example Markov Logic Networks49 Friends(A,A) Friends(B,B) Friends(A,B) Friends(B,A) Plays(A) Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A) Fired(B)
Inference In principle, P(F 1 |F 2,M L,C ) can be approximated using MCMC (Markov Chain Monte Carlo).
Inference In principle, P(F 1 |F 2,M L,C ) can be approximated using MCMC (Markov Chain Monte Carlo). Gibbs Sampling: state ← random truth assignment for i ← 1 to num-samples do for each variable x sample x according to P(x|neighbors(x)) state ← state with new value of x P(F) ← fraction of states in which F is true
Inference In principle, P(F 1 |F 2,M L,C ) can be approximated using MCMC (Markov Chain Monte Carlo). But there is Deterministic dependencies htat can break MCMC
Inference In principle, P(F 1 |F 2,M L,C ) can be approximated using MCMC (Markov Chain Monte Carlo). But there is Deterministic dependencies htat can break MCMC MC-SAT: – Combine MCMC and WalkSat
Learning Learning weights Learning structure(formula)
Learning Weight Assumption: Closed world assumption Anything I didn’t see is false – Otherwise EM
Learning Weight: Generative Using Pseudo-likeihood:
Pseudo-likelihood
It is efficient but bad for long range dependency
Voted Perceptron w i ← 0 for t ← 1 to T do y MAP ← Viterbi(x) w i ← w i + η [count i (y Data ) – count i (y MAP )] return ∑ t w i / T w i ← 0 for t ← 1 to T do y MAP ← MaxWalkSAT(x) w i ← w i + η [count i (y Data ) – count i (y MAP )] return ∑ t w i / T
Learning Structure Structure can also be learned. Start with hand-coded KB Add/Remove literal, flip sign Using PL + Structure prior Search
Example Markov Logic Networks61 Friends(A,A)Friends(B,B) Friends(A,B) Friends(B,A) Plays(A)Plays(B) query: Fired(A) Evidence: Friends(A,B), Friends(B,A), Plays(B) Fired(A)Fired(B)
Alchemy Open Source software package developed by University of Washington
Alchemy:Example Name Entity Resolution: Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) InField(i+1,+f,c) f != f ’ => (!InField(i,+f,c) v !InField(i,+f ’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i ’,c ’ ) ^ InField(i ’,+f,c ’ ) => SameField(+f,c,c ’ ) SameField(+f,c,c ’ ) SameCit(c,c ’ ) SameField(f,c,c ’ ) ^ SameField(f,c ’,c ” ) => SameField(f,c,c ” ) SameCit(c,c ’ ) ^ SameCit(c ’,c ” ) => SameCit(c,c ” )
Application Information extraction Entity resolution Web Mining Natural language processing Social network analysis And more
Application: Jointly Disambiguating and Clustering Concepts and Entities Disambiguation
Application: Jointly Disambiguating and Clustering Concepts and Entities Clustering
Application: Jointly Disambiguating and Clustering Concepts and Entities Jointly Disambiguating and clustering
Application: Jointly Disambiguating and Clustering Concepts and Entities
Features – Local Features Prior probability (p3, f7) Relatedness (p4, f8, f11) Local context similarity(p5, f9) String edit distance (p6, f10)
Application: Jointly Disambiguating and Clustering Concepts and Entities Features – Local Features Prior probability (p3, f7) Relatedness (p4, f8, f11) Local context similarity(p5, f9) String edit distance (p6, f10) – Global Features Shared lemma (p7, f12) Head match (p8,f6) Acronyms (p8,f6) Cross-document n-gram feature (p9,f13)
Application: Jointly Disambiguating and Clustering Concepts and Entities System
Application: Jointly Disambiguating and Clustering Concepts and Entities System
Application: Jointly Disambiguating and Clustering Concepts and Entities System
Application: Jointly Disambiguating and Clustering Concepts and Entities System
Application: Jointly Disambiguating and Clustering Concepts and Entities System
Application: Jointly Disambiguating and Clustering Concepts and Entities Evaluation
Application: Jointly Disambiguating and Clustering Concepts and Entities Evaluation
Application: SRL Riedel & Meza-Ruiz (2008) Semantic F-score 74.59% Three stage: – 1. Predicate Identification – 2. Argument Identification – 3. Argument Classification
Application: SRL They used 5 hidden predicates: – Predicate Identification isPredicate(p) [p is position] Sense(p,e) [e is sense] – Argument Identification isArgument(a) [a is word] hasRole(p,a) – Argument Classification Role(p,a,r) [r is role]
Application: SRL Local formulae W(l2,l1) Lemma(a1,l1) W(l2,l3) hasRole(p,a1) hasRole(p,a2) Lemma(a1,l1)
Application: SRL Global formulae as structural constraints – Ensure consistency between all stage Example:
Application: SRL Global formulae as structural constraints – Ensure consistency between all stage Also some soft constraints:
Application: SRL Global formulae hasRole(p,a1) isArg(a1) Role(p,a1,r1) hasRole(p,a2) isArg(a2) Role(p,a1,r2) isPredicate(p) sense(p,e)
Application: SRL They compare five model: ModelWSJBrownTrain TimeTest Time Full75.72%65.38%25h24m Up76.96%63.86%11h14m Down73.48%59.34%22h23m Isolated60.49%48.12%11h14m Structural74.93%64.23%22h33m
Application: SRL Global formulae as structural constraints – Ensure consistency between all stage Example: