Affiliation: Kyoto University

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011

For understanding distributions
Terminologies For understanding distributions

Terminologies Schur complement: relationship between original matrix and its inverse. Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic. Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x)

Terminologies (cont.) Trace Tr(W) is sum of diagonals
[Stochastic appro., wiki., 2011] Condition on that Trace Tr(W) is sum of diagonals Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.

Gaussian distributions and motives

Conditional Gaussian Distribution
Assume y=Xa, x=Xb Derivation of conditional mean and variance: Noting Schur complement Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb.

Marginal Gaussian Distribution
Goal is also to identify mean and variance by ‘completing the square’. Solving above integration while noting Schur complement and compare components

Bayesian relationship with Gaussian distr. (quick view)
Consider multivariable Gaussian where Thus According to Bayesian equation The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x) Ie. becomes

Bayesian relationship with Gaussian distr.
Starting from Mean and var. for joint Gaussian distr. P(x,y) Mean and variance for P(x|y) Can be seem as prior Can be seem as likelihood Can be seem as posterior

Bayesian relationship with Gaussian distr., sequential est.
Estimate mean by (N-1)+1 observations Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. solve for by Robbin-Monro

Bayesian relationship with Univariate Gaussian distr.
Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function Conjugate prior of univariate Gaussian is Gaussian-gamma function

Bayesian relationship with Multivariate Gaussian distr.
Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr. Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.

Gaussian distributions variations

Student’s t-distr Use in analysis of variance on whether effect is real and statistical significant using t-distri. w/ n-1 degree of freedom. If Xi are normal random then T-distr. has lower peak and longer tail (allow more outliers thus robust) than Gaussian distr. Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision

Student’s t-distr (cont.)
For multivariate Gaussian , corresponding t-distri. Mahalanobis dist. Mean, variance

Gaussian with periodic variables
To avoid mean been dependent on choice of origin use polar coordinate Solve for theta Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle

Gaussian with periodic variables (cont.)
From Gaussian of Cartesian coordinate to polar Becomes Von Mises distr. Mean Precision (concentration)

Gaussian with periodic variables: mean and variance
Solving log likelihood mean precision ‘m’ By noting

Mixture of Gaussians In part1 we already know one limitation of Gaussian is unimodal property. Solution: linear comb. (superposition) of Gaussians Mixing coefficients sum to 1 Posterior here is known as ‘responsibilities’ Log likelihood:

Exponential family Natural form 1) Bernoulli 2) Multinomial
Normalize by 1) Bernoulli Becomes 2) Multinomial

Exponential family (cont.)
3) Univariate Gaussian Becomes Solve for natural parameter From max. likelihood

And interesting methodologies
Parameters of Distributions And interesting methodologies

Uninformative priors “Subjective Bayesian”: avoid incorrect assumption by using uninformative (ex. uniform distr.) prior. Improper prior: prior need not sum to 1 for posterior to sum to 1 as per Bayes equation. 1) location parameter for translation invariance 2) scale parameter for scale invariance in

Nonparametric methods
Instead of assume form of distribution, use nonparametric methods. 1) Histogram of constant bin width Good for sequential data Problem: discontinuity, dimensionality increase exp. 2) Kernel estimators: sum of Parzen windows ‘N’ Observations falling in region R (volume V) is ‘K’ becomes

Nonparametric method: Kernel estimators
2) Kernel estimators: fix V, determine K Form of kernel function for points falling in R h>0 is fixed parameter bandwidth for smoothing Parzen estimator. Can choose k(u) (ex. Gaussian)

Nonparametric method: Nearest-neighbor
3) Nearest neighbor: this time use data to grow V Prior: Same as kernel estimator: training set is store as knowledge base. ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions. For classifying N points into Nk points in class Ck from Bayesian maximize

Nonparametric method: Nearest-neighbor (cont.)
3) Nearest neighbor: assign new point to class Ck by majority vote of its k nearest neighbors ……………… for k=1 and n->∞ , error is bounded by Bayes error rate [k-nearest neighbor algorithm, wiki., 2011]

From David Barber’s book
Ch.2 Basic Graph Concepts From David Barber’s book

Directed and undirected graphs
G with vertices and edges that are directed or undirected. Directed graph, A->B but not B->A then A is ancestor or parent, where B is child Directed Acyclic Graph (DAG): directed graph with no cycle (no revisit of vertex) Connected undirected graph: path between every vertices Clique: cycle for undirected graph

Representations of Graphs
Singled connected (tree): only one path from A to B Spanning tree of undirected graph: singly connected subset covering all vertices Graph representation (numerical) Edge list: ex. Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.

Representations of Graphs (cont.)
Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix Provided there are no edge from a vertex to itself K maximum clique undirected graph has a N x K matrix, where each column Ck express which nodes form a clique. 2 cliques: vertices {1,2,3} and {2,3,4}

Incidence Matrix Adjacency matrix A and incidence matrix Zinc
Maximum clique incidence matrix Z Property: Note: Zinc columns denote edges, and rows denote vertices

Additional Information
Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page Slide uploaded to Google group. Use with reference.

Affiliation: Kyoto University

Similar presentations

Presentation on theme: "Affiliation: Kyoto University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Affiliation: Kyoto University

Similar presentations

Presentation on theme: "Affiliation: Kyoto University"— Presentation transcript:

Similar presentations

About project

Feedback