Logic for Artificial Intelligence

Slides:

Advertisements

Similar presentations

Discriminative Training of Markov Logic Networks

Advertisements

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 March, 25, 2015 Slide source: from Pedro Domingos UW.

Exact Inference in Bayes Nets

Dynamic Bayesian Networks (DBNs)

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis Lecture)

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)

Markov Logic: Combining Logic and Probability Parag Singla Dept. of Computer Science & Engineering Indian Institute of Technology Delhi.

Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u

Markov Networks.

Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Jesse Davis, Stanley Kok,

Markov Logic Networks Hao Wu Mariyam Khalid. Motivation.

Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Statistical Relational Learning Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Inference. Overview The MC-SAT algorithm Knowledge-based model construction Lazy inference Lifted inference.

Unifying Logical and Statistical AI Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint work with Stanley Kok, Daniel Lowd,

5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.

Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

Markov Logic: A Simple and Powerful Unification Of Logic and Probability Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.

Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.

1 Midterm Exam Mean: 72.7% Max: % Kernel Density Estimation.

1 CMSC 471 Fall 2002 Class #19 – Monday, November 4.

Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.

Markov Logic: A Unifying Language for Information and Knowledge Management Pedro Domingos Dept. of Computer Science & Eng. University of Washington Joint.

Machine Learning For the Web: A Unified View Pedro Domingos Dept. of Computer Science & Eng. University of Washington Includes joint work with Stanley.

Markov Logic And other SRL Approaches

Markov Random Fields Probabilistic Models for Images

Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

First-Order Logic and Inductive Logic Programming.

1 Markov Logic Stanley Kok Dept. of Computer Science & Eng. University of Washington Joint work with Pedro Domingos, Daniel Lowd, Hoifung Poon, Matt Richardson,

CPSC 322, Lecture 31Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 25, 2015 Slide source: from Pedro Domingos UW & Markov.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

CPSC 322, Lecture 30Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30 Nov, 23, 2015 Slide source: from Pedro Domingos UW.

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.

Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.

Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.

Scalable Statistical Relational Learning for NLP William Y. Wang William W. Cohen Machine Learning Dept and Language Technologies Inst. joint work with:

CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.

New Rules for Domain Independent Lifted MAP Inference

An Introduction to Markov Logic Networks in Knowledge Bases

CS 541: Artificial Intelligence

Markov Logic Networks for NLP CSCI-GA.2591

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 29

First-Order Logic and Inductive Logic Programming

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

CSCI 5822 Probabilistic Models of Human and Machine Learning

Learning Markov Networks

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Professor Marie desJardins,

Markov Networks.

CS 188: Artificial Intelligence

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence Fall 2008

Expectation-Maximization & Belief Propagation

Class #16 – Tuesday, October 26

Class #22/23 – Wednesday, November 12 / Monday, November 17

Mostly pilfered from Pedro’s slides

Markov Networks.

Presentation transcript:

Logic for Artificial Intelligence Probabilistic Logics Logic for Artificial Intelligence Yi Zhou

Content Propositional probabilistic logic Bayesian network Markov logic network Conclusion

Content Propositional probabilistic logic Bayesian network Markov logic network Conclusion

Propositional Probabilistic Logic Representation Propositional formulas with probability e.g., Pr(x∧y)=0.8, Semantics Pr(ϕ)= Σw|= ϕ Pr(w) Reasoning axiom system not much can be reasoned

Content Propositional probabilistic logic Bayesian network Markov logic network Conclusion

Conditional Probability

Bayesian Theorem

P(S) P(C|S) P(S) P(C|S)

Sample of General Product Rule X1 X3 X2 X5 X4 X6 p(x1, x2, x3, x4, x5, x6) = p(x6 | x5) p(x5 | x3, x2) p(x4 | x2, x1) p(x3 | x1) p(x2 | x1) p(x1)

Inference tasks Simple queries: Computer posterior marginal P(Xi | E=e) E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false) Conjunctive queries: P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e) Optimal decisions: Decision networks include utility information; probabilistic inference is required to find P(outcome | action, evidence) Value of information: Which evidence should we seek next? Sensitivity analysis: Which probability values are most critical? Explanation: Why do I need a new starter motor?

Approaches to inference Exact inference Enumeration Belief propagation in polytrees Variable elimination Clustering / join tree algorithms Approximate inference Stochastic simulation / sampling methods Markov chain Monte Carlo methods Genetic algorithms Neural networks Simulated annealing Mean field theory

Direct inference with BNs Instead of computing the joint, suppose we just want the probability for one variable Exact methods of computation: Enumeration Variable elimination Join trees: get the probabilities associated with every query variable

Inference by enumeration Add all of the terms (atomic event probabilities) from the full joint distribution If E are the evidence (observed) variables and Y are the other (unobserved) variables, then: P(X|e) = α P(X, E) = α ∑ P(X, E, Y) Each P(X, E, Y) term can be computed using the chain rule Computationally expensive!

Example: Enumeration P(xi) = Σ πi P(xi | πi) P(πi) b c d e P(xi) = Σ πi P(xi | πi) P(πi) Suppose we want P(D=true), and only the value of E is given as true P (d|e) =  ΣABCP(a, b, c, d, e) =  ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c) With simple iteration to compute this expression, there’s going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C=true)

Exercise: Enumeration p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart study .9 .7 study .5 .1 pass p(pass|…) smart smart prep prep fair .9 .7 .2 fair .1 Query: What is the probability that a student studied, given that they pass the exam?

Building BN Structures Problem Domain Bayesian Network Probability Elicitor Expert Knowledge Problem Domain Bayesian Network Learning Algorithm Training Data Problem Domain Bayesian Network Expert Knowledge Learning Algorithm Training Data

Learning Probabilities from Data Exploit conjugate distributions Prior and posterior distributions in same family Given a pre-defined functional form of the likelihood For probability distributions of a variable defined between 0 and 1, and associated with a discrete sample space for the likelihood Beta distribution for 2 likelihood states (e.g., head on a coin toss) Multivariate Dirichlet distribution for 3+ states in likelihood space

Learning BN Structure from Data Entropy Methods Earliest method Formulated for trees and polytrees Conditional Independence (CI) Define conditional independencies for each node (Markov boundaries) Infer dependencies within Markov boundary Score Metrics Most implemented method Define a quality metric to maximize Use greedy search to determine the next best arc to add Stop when metric does not increase by adding an arc Simulated Annealing & Genetic Algorithms Advancements over greedy search for score metrics

Features for Adding Knowledge to Learning Structure Define Total Order of Nodes Define Partial Order of Nodes by Pairs Define “Cause & Effect” Relations

Content Propositional probabilistic logic Bayesian network Markov logic network Conclusion

Markov Logic Syntax: Weighted first-order formulas Semantics: Templates for Markov nets Inference: WalkSAT, MCMC, KBMC Learning: Voted perceptron, pseudo-likelihood, inductive logic programming.

Markov Networks Smoking Cancer Asthma Cough Undirected graphical models Smoking Cancer Asthma Cough Potential functions defined over cliques Smoking Cancer Ф(S,C) False 4.5 True 2.7

Markov Networks Smoking Cancer Asthma Cough Undirected graphical models Smoking Cancer Asthma Cough Log-linear model: Weight of Feature i Feature i

First-Order Logic Constants, variables, functions, predicates E.g.: Anna, x, MotherOf(x), Friends(x,y) Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground predicates

Definition A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w

Example: Friends & Smokers

Example: Friends & Smokers

Example: Friends & Smokers

Example: Friends & Smokers Two constants: Anna (A) and Bob (B)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Smokes(A) Smokes(B) Cancer(A) Cancer(B)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Markov Logic Networks MLN is template for ground Markov nets Probability of a world x: Typed variables and constants greatly reduce size of ground Markov net Functions, existential quantifiers, etc. Infinite and continuous domains Weight of formula i No. of true groundings of formula i in x

MAP/MPE Inference Problem: Find most likely state of world given evidence Query Evidence

MAP/MPE Inference Problem: Find most likely state of world given evidence

MAP/MPE Inference Problem: Find most likely state of world given evidence

MAP/MPE Inference Problem: Find most likely state of world given evidence This is just the weighted MaxSAT problem Use weighted SAT solver (e.g., MaxWalkSAT [Kautz et al., 1997] ) Potentially faster than logical inference (!)

The WalkSAT Algorithm for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if all clauses satisfied then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes number of satisfied clauses return failure

The MaxWalkSAT Algorithm for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if ∑ weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found

But … Memory Explosion Problem: If there are n constants and the highest clause arity is c, the ground network requires O(n ) memory Solution: Exploit sparseness; ground clauses lazily → LazySAT algorithm [Singla & Domingos, 2006] c

Computing Probabilities P(Formula|MLN,C) = ? MCMC: Sample worlds, check formula holds P(Formula1|Formula2,MLN,C) = ? If Formula2 = Conjunction of ground atoms First construct min subset of network necessary to answer query (generalization of KBMC) Then apply MCMC (or other) Can also do lifted inference [Braz et al, 2005]

Ground Network Construction queue ← query nodes repeat node ← front(queue) remove node from queue add node to network if node not in evidence then add neighbors(node) to queue until queue = Ø

MCMC: Gibbs Sampling state ← random truth assignment for i ← 1 to num-samples do for each variable x sample x according to P(x|neighbors(x)) state ← state with new value of x P(F) ← fraction of states in which F is true

But … Insufficient for Logic Problem: Deterministic dependencies break MCMC Near-deterministic ones make it very slow Solution: Combine MCMC and WalkSAT → MC-SAT algorithm [Poon & Domingos, 2006]

Learning Data is a relational database Closed world assumption (if not: EM) Learning parameters (weights) Generatively Discriminatively Learning structure (formulas)

Generative Weight Learning Maximize likelihood Use gradient ascent or L-BFGS No local maxima Requires inference at each step (slow!) No. of true groundings of clause i in data Expected no. true groundings according to model

Pseudo-Likelihood Likelihood of each variable given its neighbors in the data [Besag, 1975] Does not require inference at each step Consistent estimator Widely used in vision, spatial statistics, etc. But PL parameters may not work well for long inference chains

Discriminative Weight Learning Maximize conditional likelihood of query (y) given evidence (x) Approximate expected counts by counts in MAP state of y given x No. of true groundings of clause i in data Expected no. true groundings according to model

Voted Perceptron Originally proposed for training HMMs discriminatively [Collins, 2002] Assumes network is linear chain wi ← 0 for t ← 1 to T do yMAP ← Viterbi(x) wi ← wi + η [counti(yData) – counti(yMAP)] return ∑t wi / T

Voted Perceptron for MLNs HMMs are special case of MLNs Replace Viterbi by MaxWalkSAT Network can now be arbitrary graph wi ← 0 for t ← 1 to T do yMAP ← MaxWalkSAT(x) wi ← wi + η [counti(yData) – counti(yMAP)] return ∑t wi / T

Structure Learning Generalizes feature induction in Markov nets Any inductive logic programming approach can be used, but . . . Goal is to induce any clauses, not just Horn Evaluation function should be likelihood Requires learning weights for each candidate Turns out not to be bottleneck Bottleneck is counting clause groundings Solution: Subsampling

Structure Learning Initial state: Unit clauses or hand-coded KB Operators: Add/remove literal, flip sign Evaluation function: Pseudo-likelihood + Structure prior Search: Beam [Kok & Domingos, 2005] Shortest-first [Kok & Domingos, 2005] Bottom-up [Mihalkova & Mooney, 2007]

Content Propositional probabilistic logic Bayesian network Markov logic network Conclusion

Applications Information retrieval Link prediction Machine learning Semantic parsing

Probabilistic logics Propositional probabilistic logic = propositional logic + probability First-order probabilistic logic = propositional probabilistic logic + first-order quantifier Bayesian network = propositional probabilistic logic + conditional dependence represented by DAG Markov logic network = first-order probabilistic logic + conditional dependence represented by undirected graph Representation, reasoning, learning

Classical logics vs Probabilistic logics

Thank you!