机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心

课程基本信息  主讲教师：陈昱 chen_yu@pku.edu.cn Tel ： 82529680  助教：程再兴， Tel ： 62763742 wataloo@hotmail.com  课程网页： http://www.icst.pku.edu.cn/course/jiqixuexi/jqxx201 1.mht 2

Ch10 Graphical Models  Bayesian Network  Conditional Independence  Markov Random Field  Inference in Graphical Model  Learning Bayesian Network  Expectation Maximization (EM) Algorithm 3

Acknowledgement  Most material in this slide comes from C. Bishop’s slide for Ch8 in his PRML book, downloadable from PRML Web page PRML Web page 4

Introduction: An Illustrative Example  I‘m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it‘s set off by minor earthquakes. Is there a burglar?  Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls.  Network topology reflects "causal" knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Example (contd)

Introduction (contd)  Benefits of diagrammatic representation of probability distribution Visualizing the structure of the model, and therefore providing insight into properties of the model Complex computations, required to perform inference and learning in model, can be expressed in terms of graphical manipulations  Types of graphical models: Bayesian network (directed graphical model) Markov random field (undirected one)  Factor graph representation for inference 7 Probabilistic Graphical Model Example

Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks General Factorization

Bayesian Curve Fitting (1) Polynomial

Bayesian Curve Fitting (2) Plate

Bayesian Curve Fitting (3) Input variables and explicit hyperparameters

Bayesian Curve Fitting—Learning Condition on data

Bayesian Curve Fitting—Prediction Predictive distribution: Replace the prior probability of w by the posterior, it follows that

Generative Models Causal process for generating images Ancestral sampling: sampling parents of x k before sampling x k itself.

Discrete Variable Consider a parent-child pair, which is a basic building block in a probabilistic graphical model  Recall that the probability distribution p(x|μ) for a single variable x having K possible states is given by

Discrete Variables (1) General joint distribution: K 2 { 1 parameters Independent joint distribution: 2(K { 1) parameters

Discrete Variables (2) General joint distribution over M variables: K M { 1 parameters M -node Markov chain: K { 1 + (M { 1) K(K { 1) parameters

Discrete Variables: Bayesian Parameters (1)

Discrete Variables: Bayesian Parameters (2) Shared prior

Parameterized Conditional Distributions If are discrete, K -state variables, in general has O(K M ) parameters. The parameterized form requires only M + 1 parameters

Linear-Gaussian Models Directed Graph Vector-valued Gaussian Nodes Each node is Gaussian, the mean is a linear function of the parents.

Conditional Independence a is independent of b given c Equivalently Equivalently, b is independent of a given c Notation

Conditional Independence: Example 1

Conditional Independence: Example 2

Conditional Independence: Example 3 Note: this is the opposite of Example 1, with c unobserved.

Conditional Independence: Example 3 Note: this is the opposite of Example 1, with c observed.

“Am I out of fuel?” B =Battery (0=flat, 1=fully charged) F =Fuel Tank (0=empty, 1=full) G =Fuel Gauge Reading (0=empty, 1=full) and hence

“Am I out of fuel?”  Note that probability of an empty tank increases by observing G = 0. 33

“Am I out of fuel?”  Note that probability of an empty tank reduces by observing B = 0. This refers to as “explaining away”. 34

The Burglary Example  p(B=True|J=True, M=True)=?

D-separation A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked if it contains a node such that either a)the arrows on the path meet either head-to-tail or tail- to-tail at the node, and the node is in the set C, or b)the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d- separated from B by C. If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies.

D-separation: Example

D-separation: I.I.D. Data

Directed Graphs as Distribution Filters

The Markov Blanket Factors independent of x i cancel between numerator and denominator.

Conditional Independence Properties Markov Blanket Conditional distribution of x i, conditioned on all the other nodes, is dependent only on the variables in the Markov blanket.

Factorization Properties  Wanted: a factorization rule for undirected graph corresponding to the conditional independence test, i.e. the joint distribution should be represented as product of functions defined over random variables local in the graph.  Q: What is an appropriate definition of locality?  Consider a simple case: x i and x j be two variables s.t. there is no link between them, it follows that

Cliques and Maximal Cliques Clique Maximal Clique

Joint Distribution where is the potential over clique C and is the normalization coefficient; note: M K -state variables  K M terms in Z. Energies and the Boltzmann distribution

Illustration: Image De-Noising Original image described by binary pixel values x i =±1 Noisy image described by binary pixel values y i =±1: It is generated by flipping the sign of pixel value x i in the original image with some small probability

Formulation of the De-Noising Such formulation is an example of Ising model, which has been widely studied in statistical physics.

Choosing an Energy Function

De-Noising by ICM Algorithm Noisy ImageRestored Image (ICM)

De-Noising by Graph-Cut Algorithm Restored Image (Graph cuts)Restored Image (ICM)

Converting Directed to Undirected Graphs (1)

Converting Directed to Undirected Graphs (2) Additional links The process of “marrying the parents” is called moralization

Conditional Independence in Directed & Undirected Graphs  D map: A graph is said to be a D map of a distribution if every conditional independence statement satisfied by the distribution is reflected in the graph E.g. a complete disconnected graph is a D map of any distribution  I map: If every conditional independence statement implied by a graph is satisfied by a specific distribution, then the graph is said to be an I map of that distribution. E.g. a fully connected graph is an I map of any distribution  Perfect map: If every conditional independence property of a distribution is reflected in a graph, and vice versa, then the graph is said to be a perfect map for the distribution (both D & I map).

(Contd) Conditional Independence P: set of all distributions over a given set of variables D: distributions that can be represented as a perfect map using a directed graph U: distributions that can be represented as a perfect map using an undirected graph

(Contd) Conditional Independence The above conditional independence properties can’t be represented using an undirected graph over the same variables The above conditional independence properties can’t be represented using a directed graph over the same variables

Introduction  Inference: Assume some nodes (random variables) in a graph are clamped to observed values, compute the posterior distribution of certain subsets of the nodes.  Notice that a posterior distribution is a quotient of two (marginal) probability with a portions of variables are clamped to observed values, therefore the basic building block of an inference problem is the computation of a marginal distribution.  A marginal distribution is represented in terms of sum-products; Since the underlying variables form a graphic network, often satisfying properties of conditional independence, computing these sum-products can be regarded as propagation of local messages. 57

Introduction (2)  A inference task should utilize the structure of a graphic network so as to make computation more efficient Chain (simple structure) Tree (a path between any pair of nodes) Arbitrary graph → junction tree  Focus on exact inference  Approximate method Variational method Sampling Loopy belief propagation 58

Example: Graphical Representation of Bayes’ Theorem where

Inference on a Chain (1)

Inference on a Chain: Summary To compute probability distribution of a single variable p(x n ): Compute all forward messages, and finally evaluate. Compute all backward messages, and finally evaluate. Evaluate

Margins of Multi-variables  Similarly, the joint distribution for two neighboring nodes on the chain can be written as  We can extend the above formula to the case of k neighboring nodes on the chain, but if these nodes aren’t neighboring, then the formula becomes less elegant.  Note that the time complexity of computing these marginal distributions is linear in terms of number of nodes.

Trees Undirected Tree, exactly one path between any pair of nodes Directed Tree, every non-root node has exactly one parent. (directed) Polytree, one path between any pair of nodes. Here tree is defined as a structure satisfying: (1) no loop; (2) exactly a path between any pair of nodes. It can be separated into three kinds of tree for a graphic network (see above) The message passing formulism derived for chain can be extended to trees

Factor Graphs

Factor Graphs from Directed Graphs

Factor Graphs from Undirected Graphs

Remarks on Conversion  For both directed and undirected graphs, there are multiple ways of converting them into factor graphs  For all the three kinds of tree, converting them into a factor graph results also a tree. In contrast, converting a directed polytree into an undirected graph results in a graph with loops A directed polytree Converting into an undirected tree Converting into a factor graph

Remarks on Conversion (2)  Local cycles in both directed and undirected graph can be removed on conversion by defining the appropriate factor functions, as following figures illustrate: 71

The Sum-Product Algorithm (1) Objective: i.to obtain an efficient, exact inference algorithm for finding marginals; ii.in situations where several marginals are required, to allow computations to be shared efficiently. Key idea: Distributive Law

The Sum-Product Algorithm (2) ne(x): set of factor nodes that are neighbors of variable x X s : set of all variables in the subtree connected to x via the factor node f s F s (x, X s ): product of all the factors in the group associated with factor node f s

The Sum-Product Algorithm (3)

The Sum-Product Algorithm (7) Initialization

The Sum-Product Algorithm (8) To compute local marginals: Pick an arbitrary node as root Compute and propagate messages from the leaf nodes to the root, storing received messages at every node. Compute and propagate messages from the root to the leaf nodes, storing received messages at every node. Compute the product of received messages at each node for which the marginal is required, and normalize if necessary.

Sum-Product: Example (1)

The Max-Sum Algorithm (1) Objective: an efficient algorithm for finding i.the value x max that maximises p(x) ; ii.the value of p(x max ). In general, maximum marginals  joint maximum.

The Max-Sum Algorithm (2) Maximizing over a chain (max-product)

The Max-Sum Algorithm (3) Generalizes to tree-structured factor graph maximizing as close to the leaf nodes as possible

The Max-Sum Algorithm (4) Max-Product  Max-Sum For numerical reasons, use Again, use distributive law

The Max-Sum Algorithm (5) Initialization (leaf nodes) Recursion

The Max-Sum Algorithm (6) Termination (root node) Back-track, for all nodes i with l factor nodes to the root ( l=0 )

The Max-Sum Algorithm (7) Example: Markov chain

The Junction Tree Algorithm Exact inference on general graphs. Works by turning the initial graph into a junction tree and then running a sum- product-like algorithm. Intractable on graphs with large cliques.

Loopy Belief Propagation Sum-Product on general graphs. Initial unit messages passed across all links, after which messages are passed around until convergence (not guaranteed!). Approximate but tractable for large graphs. Sometime works well, sometimes not at all.

Learning Bayesian Network The complexity of learning task depends on whether the structure of the network (nodes and links) is known and whether the training examples provide values of all network variables, or just some. If structure known and all variables are observable  The condition probability tables (CPT) can be estimated from the training examples by just counting the frequency, like the case of Naïve Bayes classifier. Recall that a CPT can be denoted as w ijk = P{node(Y i ) = y ij | parents(Y i ) = list u ik of values}. For example, consider “Alarm” node in the “Burglary” network, it has two possible values, whereas its parents has four possible values. 94

Learning Bayesian Network (2)  If the structure is known, but some nodes are unobservable, then one approach of learning the related CPTs is analogous to learning hidden units in a neural network. (Russell et al. 1995) Let the hypotheses space denote the set of all possible values for entries in the CPTs, then the desired values are modeled as the ML hypothesis given training examples D, and the search is performed via gradient ascent method. It can be shown that 95

Learning Bayesian Network (3)  (Continue of gradient ascent method) Therefore the weight (CPT value) update rule becomes  Just as other gradient-based algorithms, the above algorithm only guarantees that its solution converges to local maximum (likelihood) hypothesis.  An alternative method to gradient-based method is EM (expectation-maximization) algorithms (also converges to local maximum) 96

Learning Bayesian Network (4)  If the structure is unknown, there are a couple of approaches: Add/Subtract edges and nodes based on certain optimization function Start from a simple network and add nodes and edges, e.g. CASCADE-CORRELATION algorithm for dynamically modifying neural network Start from a complex one and subtract nodes and edges Given a list of candidate structures, score them and see which one best fits the data (sort of most likely) Ref: N. Friedman and Z. Yakhini, “On the sample complexity of learning Bayesian networks”, Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 1996. 97

Learning Bayesian Network (5)  There is also a Bayesian approach to learning a Bayesian network, i.e. seeking MAP rather than ML hypothesis. E.g. David Heckerman et. al, “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data”, Machine Learning, pp 20—197, 1995.Learning Bayesian Networks: The Combination of Knowledge and Statistical Data 98

HW  8.11 (10 pt, due Nov 30, Wednesday)

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

Similar presentations

Presentation on theme: "机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页："— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

Similar presentations

Presentation on theme: "机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页："— Presentation transcript:

Similar presentations

About project

Feedback

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

Presentation on theme: "机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页："— Presentation transcript: