Al Parker September 14, 2010 Drawing samples from high dimensional Gaussians using polynomials.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Ordinary Least-Squares

EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.

General Linear Model With correlated error terms  =  2 V ≠  2 I.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Gibbs Sampling Qianji Zheng Oct. 5th, 2010.

Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.

1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.

Linear Systems of Equations

Solving Linear Systems (Numerical Recipes, Chap 2)

Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.

Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.

Al Parker October 29, 2009 Using linear solvers to sample large Gaussians.

Determinants Bases, linear Indep., etc Gram-Schmidt Eigenvalue and Eigenvectors Misc

Function Optimization Newton’s Method. Conjugate Gradients

Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.

Shawn Sickel A Comparison of some Iterative Methods in Scientific Computing.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Ordinary least squares regression (OLS)

CS240A: Conjugate Gradients and the Model Problem.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

CSE 245: Computer Aided Circuit Simulation and Verification

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Separate multivariate observations

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.

Al Parker and Colin Fox SUQ13 June 4, 2013 Using polynomials and matrix splittings to sample from LARGE Gaussians.

1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.

CSE 185 Introduction to Computer Vision Face Recognition.

CSE 245: Computer Aided Circuit Simulation and Verification Matrix Computations: Iterative Methods I Chung-Kuan Cheng.

CS240A: Conjugate Gradients and the Model Problem.

Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }

Elementary Linear Algebra Anton & Rorres, 9 th Edition Lecture Set – 07 Chapter 7: Eigenvalues, Eigenvectors.

Al Parker July 19, 2011 Polynomial Accelerated Iterative Sampling of Normal Distributions.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Al Parker January 18, 2009 Accelerating Gibbs sampling of Gaussians using matrix decompositions.

1 Chapter 7 Numerical Methods for the Solution of Systems of Equations.

Status of Reference Network Simulations John Dale ILC-CLIC LET Beam Dynamics Workshop 23 June 2009.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Advanced Artificial Intelligence Lecture 8: Advance machine learning.

Network Systems Lab. Korea Advanced Institute of Science and Technology No.1 Maximum Norms & Nonnegative Matrices  Weighted maximum norm e.g.) x1x1 x2x2.

MA237: Linear Algebra I Chapters 1 and 2: What have we learned?

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.

Biointelligence Laboratory, Seoul National University

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Introduction to the Finite Element Method

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

CSE 245: Computer Aided Circuit Simulation and Verification

A Comparison of some Iterative Methods in Scientific Computing

CSE 245: Computer Aided Circuit Simulation and Verification

CSE 245: Computer Aided Circuit Simulation and Verification

Chapter 10. Numerical Solutions of Nonlinear Systems of Equations

Conjugate Gradient Method

~ Least Squares example

Numerical Linear Algebra

~ Least Squares example

University of Wisconsin - Madison

Linear Algebra Lecture 16.

Numerical Modeling Ramaz Botchorishvili

Presentation transcript:

Al Parker September 14, 2010 Drawing samples from high dimensional Gaussians using polynomials

Colin Fox, Physics, University of Otago New Zealand Institute of Mathematics, University of Auckland Center for Biofilm Engineering,, Bozeman Acknowledgements

The normal or Gaussian distribution

y = (σ 2 ) 1/2 z + µ ~ N(µ,σ 2 ) How to sample from a Gaussian N(µ,σ 2 )? Sample z ~ N(0,1)

The multivariate Gaussian distribution

y = Σ 1/2 z+ µ ~ N(µ,Σ) How to sample from a Gaussian N(µ,Σ)? Sample z ~ N(0, I ) (eg y = WΛ 1/2 z + µ)

Example: From 64 faces, modeling “face space” with a Gaussian Process N(μ,Σ) Pixel intensity at the ith row and jth column is y(s(i,j)), y(s) є R 112 x R 112 μ(s) є R 112 x R 112 Σ(s,s) є R x R 12544

~N(,Σ)

How to estimate μ,Σ for N(μ,Σ)? MLE/BLUE (least squares) MVQUE Use a Bayesian Posterior via MCMC

Another example: Interpolation

One can assume a covariance function which has some parameters θ

I used a Bayesian posterior for θ|data to construct μ|data

Simulating the process: samples from N(μ,Σ|data) y = Σ 1/2 z + µ ~N(μ, )

Gaussian Processes modeling global ozone Cressie and Johannesson, Fixed rank krigging for very large spatial datasets, 2006

Gaussian Processes modeling global ozone

The problem To generate a sample y = Σ 1/2 z+ µ ~ N(µ,Σ), how to calculate the factorization Σ =Σ 1/2 (Σ 1/2 ) T ? Σ 1/2 = WΛ 1/2 by eigen-decomposition, 10/3n 3 flops Σ 1/2 = C by Cholesky factorization, 1/3n 3 flops For LARGE Gaussians (n>10 5, eg in image analysis and global data sets), these approaches are not possible n 3 is computationally TOO EXPENSIVE storing an n x n matrix requires TOO MUCH MEMORY

Some solutions Work with sparse precision matrix Σ -1 models (Rue, 2001) Circulant embeddings (Gneiting et al, 2005) Iterative methods: Advantages: – COST: n 2 flops per iteration – MEMORY: Only vectors of size n x 1 need be stored Disadvantages: – If the method runs for n iterations, then there is no cost savings over a direct method

Gibbs: an iterative sampler of N(0,A) and N(0, A -1 ) Let A=Σ or A= Σ -1 1.Split A into D=diag(A), L=lower(A), L T =upper(A) 2.Sample z ~ N(0, I ) 3.Take conditional samples in each coordinate direction, so that a full sweep of all n coordinates is y k =-D -1 L y k - D -1 L T y k-1 + D -1/2 z y k converges in distribution geometrically to N(0,A -1 ) Ay k converges in distribution geometrically to N(0,A)

Gibbs: an iterative sampler Gibbs sampling from N(µ,Σ) starting from (0,0)

Gibbs: an iterative sampler Gibbs sampling from N(µ,Σ) starting from (0,0)

There’s a link to solving Ax=b Solving Ax=b is equivalent to minimizing an n- dimensional quadratic (when A is spd) A Gaussian is sufficiently specified by the same quadratic (with A= Σ -1 and b=Aμ):

Gauss-Siedel Linear Solve of Ax=b 1.Split A into D=diag(A), L=lower (A), L T =upper(A) 2.Minimize the quadratic f(x) in each coordinate direction, so that a full sweep of all n coordinates is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b x k converges geometrically A -1 b

Gauss-Siedel Linear Solve of Ax=b

x k converges geometrically A -1 b, (x k - A -1 b) = G k ( x 0 - A -1 b) where ρ(G) < 1

Theorem: A Gibbs sampler is a Gauss Siedel linear solver Proof: A Gibbs sampler is y k =-D -1 L y k - D -1 L T y k-1 + D -1/2 z A Gauss-Siedel linear solve of Ax=b is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b

Gauss Siedel is a Stationary Linear Solver A Gauss-Siedel linear solve of Ax=b is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b Gauss Siedel can be written as M x k = N x k-1 + b where M = D + L and N = D - L T, A = M – N, the general form of a stationary linear solver

Stationary linear solvers of Ax=b 1.Split A=M-N 2.Iterate Mx k = N x k-1 + b 1.Split A=M-N 2.Iterate x k = M -1 Nx k-1 + M -1 b = Gx k-1 + M -1 b x k converges geometrically A -1 b, (x k - A -1 b) = G k ( x 0 - A -1 b) when ρ(G) = ρ(M -1 N)< 1

Stationary Samplers from Stationary Solvers Solving Ax=b: 1.Split A=M-N 2.Iterate Mx k = N x k-1 + b x k  A -1 b if ρ(M -1 N)< 1 Sampling from N(0,A) and N(0,A -1 ): 1.Split A=M-N 2.Iterate My k = N y k-1 + c k-1 where c k-1 ~ N(0, M T + N) y k  N(0,A -1 ) if ρ(M -1 N)< 1 Ay k  N(0,A) if ρ(M -1 N)< 1

How to sample c k-1 ~ N(0, M T + N) ? Gauss Siedel M = D + L, c k-1 ~ N(0, D) SOR (successive over-relaxation) M = 1/wD + L, c k-1 ~ N(0, (2-w)/w D) Richardson M = I, c k-1 ~ N(0, 2I-A ) Jacobi M = D, c k-1 ~ N(0, 2D-A )

Theorem: Stat Linear Solver converges iff Stat Sampler converges and the geometric convergence is geometric Proof: They have the same iteration operator: For linear solves: x k = Gx k-1 + M -1 b so that (x k - A -1 b) = G k ( x 0 - A -1 b) For sampling: y k = Gy k-1 + M -1 c k-1 E(y k )= G k E(y 0 ) Var(y k ) = A -1 - G k A -1 G kT Proof for Gaussians given by Barone and Frigessi, For arbitrary distributions by Duflo, 1997

Acceleration schemes for stationary linear solvers can be used to accelerate stationary samplers Polynomial acceleration of a stationary solver of Ax=b is 1. Split A = M - N 2. x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) which replaces (x k - A -1 b) = G k ( x 0 - A -1 b) with a k th order polynomial (x k - A -1 b) = p(G)( x 0 - A -1 b)

Chebyshev acceleration x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) Gauss-Siedel converged like this …

x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) … convergence (geometric-like) with Chebyshev acceleration

Polynomial Accelerated Stationary Sampler from N(0,A) and N(0,A -1 ) 1. Split A = M - N 2. y k+1 = (1- v k ) y k-1 + v k y k + v k u k M -1 (c k -A y k ) where c k ~ N(0, (2-v k )/v k ( (2 – u k )/ u k M T + N)

Theorem A polynomial accelerated sampler converges with the same convergence rate as the corresponding linear solver as long as v k, u k are independent of the iterates y k. Gibbs Sampler Chebyshev Accelerated Gibbs

Chebyshev acceleration is guaranteed to be faster than a Gibbs sampler Covariance matrix convergence ||A -1 – S k || 2

Chebyshev accelerated Gibbs sample in 10 6 dimensions: data = SPHERE + ε, Sample from π(SPHERE|data) ε ~ N(0,σ 2 I)

Conclusions Gaussian Processes are cool! Common techniques from numerical linear algebra can be used to sample from Gaussians Cholesky factorization (precise but expensive) Any stationary linear solver can be used as a stationary sampler (inexpensive but with geometric convergence) Stationary samplers can be accelerated by polynomials (guaranteed!) Polynomial accelerated Samplers – Chebyshev – Conjugate Gradients – Lanczos Sampler

Estimation of Σ(θ,r) from the data using a a Markov Chain

Marginal Posteriors

x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the residuals b-Ax k … convergence guaranteed in n finite steps with CG acceleration Conjugate Gradient (CG) acceleration

The theorem does not apply since the parameters v k, u k are functions of the residuals b k - A y k We have devised an approach called a CD sampler to construct samples with covariance Var(y k ) = V k D k -1 V k T  A -1 where V k is a matrix of unit length residuals b - Ax k from the standard CG algorithm. Conjugate Gradient (CG) Acceleration

A GOOD THING: The CG algorithm is a great linear solver! If the eigenvalues of A are in c clusters, then a solution to Ax=b is found in c << n steps. A PROBLEM: When the CG residuals get small, the CD sampler is forced to stop after only c << n steps. Thus, covariances with well separated eigenvalues work well. The covariance of the CD samples y k ~ N(0,A -1 ) and Ay k ~ N(0,A) have the correct covariances if A’s eigenvectors in the Krylov space spanned by the residuals have small/large eigenvalues. CD sampler (CG accelerated Gibbs)

Lanczos sampler Fix the problem of small residuals is easy: hijack the iterative Lanczos eigen-solver to produce samples y k ~ N(0,A -1 ) with Var(y k ) = W k D k -1 W k T  A -1 where W k is a matrix of “Lanczos vectors”

One extremely effective sampler for LARGE Gaussians Use a combination of the ideas presented: Generate samples with the CD or Lanczos sampler while at the same time cheaply estimating the extreme eigenvalues of G. Seed these samples and extreme eigenvalues into a Chebyshev accelerated SOR sampler