Probabilistic Inference in PRISM Taisuke Sato Tokyo Institute of Technology.

Slides:



Advertisements
Similar presentations
Big Ideas in Cmput366. Search Blind Search State space representation Iterative deepening Heuristic Search A*, f(n)=g(n)+h(n), admissible heuristics Local.
Advertisements

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
Learning HMM parameters
Dynamic Bayesian Networks (DBNs)
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Markov Networks.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Bayesian network inference
… Hidden Markov Models Markov assumption: Transition model:
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 5: Learning models using EM
Big Ideas in Cmput366. Search Blind Search Iterative deepening Heuristic Search A* Local and Stochastic Search Randomized algorithm Constraint satisfaction.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
By Neng-Fa Zhou1 PRISM: A Probabilistic Language for Modeling and Learning Joint work with Taisuke Sato (Tokyo Institute of Technology)
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CHAPTER 15 SECTION 3 – 4 Hidden Markov Models. Terminology.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Markov Logic And other SRL Approaches
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))
Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.
Randomized Algorithms for Bayesian Hierarchical Clustering
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
1 Chapter 15 Probabilistic Reasoning over Time. 2 Outline Time and UncertaintyTime and Uncertainty Inference: Filtering, Prediction, SmoothingInference:
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
Daphne Koller Overview Conditional Probability Queries Probabilistic Graphical Models Inference.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Daphne Koller Overview Maximum a posteriori (MAP) Probabilistic Graphical Models Inference.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
10 October, 2007 University of Glasgow 1 EM Algorithm with Markov Chain Monte Carlo Method for Bayesian Image Analysis Kazuyuki Tanaka Graduate School.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CSC 594 Topics in AI – Natural Language Processing
Markov Networks.
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Class #19 – Tuesday, November 3
Expectation-Maximization & Belief Propagation
LECTURE 15: REESTIMATION, EM AND MIXTURES
Topic Models in Text Processing
Bayesian inference J. Daunizeau
Markov Networks.
Presentation transcript:

Probabilistic Inference in PRISM Taisuke Sato Tokyo Institute of Technology

Problem model-specific learning algorithms Model 1 EM VB MCMC Model 2 Model n... EM 1 EM 2 EM n Statistical machine learning is a labor-intensive process: {modeling  learning  evaluation}* of trial-and-error Pains of deriving and implementing model-specific learning algorithms and model-specific probabilistic inference

Develop a high-level modeling language that offers universal learning and inference methods applicable to every model The user concentrates on modeling and the rest (learning and inference) is taken care of by the system Our solution Model 1 EM VB MCMC Model 2 Model n... modeling language

Bayesian network Bayesian network HMM New model... EM/MAP VB MCMC PRISM system VT VBVT Learning methods Probabilistic models PCFG Logic-based high-level modeling language Its generic inference/learning methods subsume standard algorithms such as FB for HMMs and BP for Bayesian networks PRISM (

Semantics program = Turing machine + probabilistic choice + Dirichlet prior denotation = a probability measure over possible worlds Propositionalized probability computation (PPC) programs written at predicate logic level probability computation at propositional logic level Dynamic programming for PPC proof search generates a directed graph (explanation graph) Probabilities are computed from bottom to top in the graph Discriminative use generatively define a model by a PRISM program and descriminatively use it for better prediction performance Basic ideas

btype(X):- gtype(Gf,Gm), pg_table(X,[Gf,Gm]). pg_table(X,GT):- ((X=a;X=b),(GT=[X,o];GT=[o,X];GT=[X,X]) ; X=o,GT=[o,o] ; X=ab,(GT=[a,b];GT=[b,a])). gtype(Gf,Gm):- msw(abo,Gf),msw(abo,Gm). ABO blood type program values(abo,[a,b,o],[0.5,0.2,0.3]). msw(abo,a) is true with prob. 0.5 probabilistic primitives simulate gene inheritance from father (left) and mother (right)

btype(a) gtype(a,a) v gtype(a,o) v gtype(o,a) gtype(a,a) msw(abo,a) & msw(abo,a) gtype(a,o) msw(abo,a) & msw(abo,o) gtype(o,a) msw(abo,o) & msw(abo,a) Propositionalized probability computation Explanation graph for btype(a) that explains how btype(a) is proved by probabilistic choice made by msw-atoms Sum-product computation of probabilities in a bottom-up manner using probabilities assigned to msw atoms Expl. graph is acyclic and dynamic programming (DP) is possible PPC+DP subsumes forward-backward, belief propagation, inside- outside computation

A program defines a joint distributionP(x,y|  ) where x hidden and y observed P(msw(abo,a),..btype(a),… |  a,  b,  o ) where  a +  b +  o =1 Learning  from observed data y by maximizing P(y|  )  MLE/MAP P(x*,y|  ) where x* = argmax _x P(x,y|  )  VT From a Bayesian point of view, a program defines marginal likelihood ∫ P(x,y| ,  ) d  We wish to compute predictive distribution = ∫ P(x|y, ,  ) d  marginal likelihood P(y|  ) =  x ∫ P(x,y| ,  ) d  Both need approximation Variational Bayes (VB)  VB, VB-VT MCMC  Metropolis-Hastings Learning

Sample session 1 - Expl. graph and prob. computation | ?- prism(blood) loading::blood.psm.out | ?- show_sw Switch gene: unfixed_p: a (p: ) b (p: ) o (p: ) | ?- probf(btype(a)) btype(a) gtype(a,a) v gtype(a,o) v gtype(o,a) gtype(a,a) msw(gene,a) & msw(gene,a) gtype(a,o) msw(gene,a) & msw(gene,o) gtype(o,a) msw(gene,o) & msw(gene,a) | ?- prob(btype(a),P) P = 0.55 built-in predicate

| ?- D=[btype(a),btype(a),btype(ab),btype(o)],learn(D) Exporting switch information to the EM routine... done #em-iters: 0(4) (Converged: ) Statistics on learning: Graph size: 18 Number of switches: 1 Number of switch instances: 3 Number of iterations: 4 Final log likelihood: | ?- prob(btype(a),P) P = | ?- viterbif(btype(a)) btype(a) <= gtype(a,a) gtype(a,a) <= msw(gene,a) & msw(gene,a) Sample session 2 - MLE and Viterbi inference

Sample session 3 - Bayes inference by MCMC | ?- D=[btype(a), btype(a), btype(ab), btype(o)], marg_mcmc_full(D,[burn_in(1000),end(10000),skip(5)],[VFE,ELM]), marg_exact(D,LogM) VFE = ELM = LogM = |?- D=[btype(a), btype(a), btype(ab),btype(o)], predict_mcmc_full(D,[btype(a)],[[_,E,_]]), print_graph(E,[lr('<=')]) btype(a) <= gtype(a,a) gtype(a,a) <= msw(gene,a) & msw(gene,a)

Summary PRISM = Probabilistic Prolog for statistical machine learning Forward sampling Exact probability computation Parameter learning MLE/MAP, VT Bayesian inference VB VBVT MCMC Viterbi inference model core (BIC,Cheesman-Stutz,VFE) smoothing Current version 2.1