Al Parker October 29, 2009 Using linear solvers to sample large Gaussians.

Al Parker October 29, 2009 Using linear solvers to sample large Gaussians

Colin Fox, Physics, University of Otago New Zealand Institute of Mathematics, University of Auckland Center for Biofilm Engineering,, Bozeman Acknowledgements

The normal or Gaussian distribution

y = (σ 2 ) 1/2 z + µ ~ N(µ,σ 2 ) How to sample from a Gaussian N(µ,σ 2 )? Sample z ~ N(0,1)

The multivariate Gaussian distribution

y = Σ 1/2 z+ µ ~ N(µ,Σ) How to sample from a Gaussian N(µ,Σ)? Sample z ~ N(0, I ) (eg y = WΛ 1/2 z + µ)

Example: From 64 faces, we can model “face space” with a Gaussian Process N(μ,Σ) Pixel intensity at the ith row and jth column is y(s(i,j)), y(s) є R 112 x R 112 μ(s) є R 112 x R 112 Σ(s,s) є R 12544 x R 12544

~N(,Σ)

How to estimate μ,Σ for N(μ,Σ)? MLE/BLUE (least squares) MVQUE Use a Bayesian Posterior via MCMC

Another example: Interpolation

One can assume a covariance function which has some parameters θ

I used a Bayesian posterior for θ|data to construct μ|data

Simulating the process: samples from N(μ,Σ|data) y = Σ 1/2 z + µ ~N(μ, )

Gaussian Processes modeling global ozone Cressie and Johannesson, Fixed rank kriging for very large spatial datasets, 2006

Gaussian Processes modeling global ozone

The problem To generate a sample y = Σ 1/2 z+ µ ~ N(µ,Σ), how to calculate the factorization Σ =Σ 1/2 (Σ 1/2 ) T ? Σ 1/2 = WΛ 1/2 by eigen-decomposition, 10/3n 3 flops Σ 1/2 = C by Cholesky factorization, 1/3n 3 flops For LARGE Gaussians (n>10 5, eg in image analysis and global data sets), these approaches are not possible n 3 is computationally TOO EXPENSIVE storing an n x n matrix requires TOO MUCH MEMORY

Some solutions Work with sparse precision matrix Σ -1 models (Rue, 2001) Circulant embeddings (Gneiting et al, 2005) Iterative methods: Advantages: – COST: n 2 flops per iteration – MEMORY: Only vectors of size n x 1 need be stored Disadvantages: – If the method runs for n iterations, then there is no cost savings over a direct method

Gibbs: an iterative sampler of N(0,A) and N(0, A -1 ) Let A=Σ or A= Σ -1 1.Decompose A by D=diag(A), L=lower(A) 2.Sample z ~ N(0, I ) 3.Take conditional samples in each coordinate direction, so that a full sweep of all n coordinates is y k =-D -1 L y k - D -1 L T y k-1 + D -1/2 z y k converges in distribution geometrically to N(0, A -1 ) Ay k converges in distribution geometrically to N(0,A)

Gibbs: an iterative sampler Gibbs sampling from N(0,Σ) starting from (0,0)

What’s the link to Ax=b? Solving Ax=b is equivalent to minimizing an n- dimensional quadratic (when A is pd) A Gaussian is sufficiently specified by the same quadratic (with A= Σ -1 and b=Aμ):

Gauss-Siedel Linear Solve of Ax=b 1.Decompose A by D=diag(A), L=lower (A) 2.Minimize the quadratic f(x) in each coordinate direction, so that a full sweep of all n coordinates is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b x k converges geometrically A -1 b

Gauss-Siedel Linear Solve of Ax=b

x k converges geometrically A -1 b, (x k - A -1 b) = G k ( x 0 - A -1 b) where ρ(G) < 1

Theorem: A Gibbs sampler is a Gauss Siedel linear solver Proof: A Gibbs sampler is y k =-D -1 L y k - D -1 L T y k-1 + D -1/2 z A Gauss-Siedel linear solve of Ax=b is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b

Gauss Siedel is a Stationary Linear Solver A Gauss-Siedel linear solve of Ax=b is x k =-D -1 L x k - D -1 L T x k-1 + D -1 b Gauss Siedel can be written as Mx k = N x k-1 + b where M = D + L and N = D - L T, A = M – N, the general form of a stationary linear solver

Stationary linear solvers of Ax=b 1.Split A=M-N 2.Iterate Mx k = N x k-1 + b 1.Split A=M-N 2.Iterate x k = M -1 Nx k-1 + M -1 b = Gx k-1 + M -1 b x k converges geometrically A -1 b, (x k - A -1 b) = G k ( x 0 - A -1 b) when ρ(G) = ρ(M -1 N)< 1

Stationary Samplers from Stationary Solvers Solving Ax=b: 1.Split A=M-N 2.Iterate Mx k = N x k-1 + b x k  A -1 b if ρ(M -1 N)< 1 Sampling from N(0,A) and N(0,A -1 ): 1.Split A=M-N 2.Iterate My k = N y k-1 + b k-1 where b k-1 ~ N(0, MA -1 M T – NA -1 N T ) = N(0,M+N) when M is symmetric y k  N(0,A -1 ) if ρ(M -1 N)< 1 Ay k  N(0,A) if ρ(M -1 N)< 1

How to sample b k-1 ~ N(0, MA -1 M T – NA -1 N T ) ? Gauss Siedel M = D + L, b k-1 ~ N(0, D) SOR (successive over-relaxation) M = 1/wD + L, b k-1 ~ N(0, (2-w)/w D) Richardson M = I, b k-1 ~ N(0, 2I-A ) Jacobi M = D, b k-1 ~ N(0, 2D-A )

Theorem: Stat Linear Solver converges iff Stat Sampler converges and the convergence is geometric Proof: They have the same iteration operator: For linear solves: x k = Gx k-1 + M -1 b so that (x k - A -1 b) = G k ( x 0 - A -1 b) For sampling: y k = Gy k-1 + M -1 b k-1 E(y k )= G k E(y 0 ) Var(y k ) = A -1 - G k A -1 G kT Proof of convergence for Gaussians by Barone and Frigessi, 1990. For arbitrary distributions by Duflo, 1997

An algorithm to make journal papers 1. Look up your favorite optimization algorithm 2. Turn it into a sampler

For Example: Acceleration schemes for Stationary Linear Solvers can work for Stationary Samplers Polynomial acceleration of a stationary solver of Ax=b is 1. Split A = M - N 2. x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) which replaces (x k - A -1 b) = G k ( x 0 - A -1 b) with a k th order polynomial (x k - A -1 b) = p(G)( x 0 - A -1 b)

Chebyshev Acceleration x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) Gauss-Siedel converged like this …

x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) … convergence (geometric-like) with Chebyshev acceleration Chebyshev Acceleration

x k+1 = (1- v k ) x k-1 + v k x k + v k u k M -1 (b-A x k ) where v k, u k are functions of the residuals b-Ax k … convergence guaranteed in n finite steps with CG acceleration Conjugate Gradient (CG) Acceleration

Polynomial Accelerated Stationary Sampler from N(0,A) and N(0,A -1) 1. Split A = M - N 2. y k+1 = (1- v k ) y k-1 + v k y k + v k u k M -1 (b k -A y k ) where b k ~ N(0, (2-v k )/v k ( (2 – u k )/ u k M + N)

Theorem A polynomial accelerated sampler converges if v k, u k are independent of the iterates y k,b k. Gibbs Sampler Chebyshev Accelerated Gibbs

The theorem does not apply since the parameters v k, u k are functions of the residuals b k - A y k We have devised an approach called a CD sampler to construct samples with covariance Var(y k ) = V k D k -1 V k T  A -1 where V k is a matrix of unit length residuals b - Ax k from the standard CG algorithm. Conjugate Gradient (CG) Acceleration

A GOOD THING: The CG algorithm is a great linear solver! If the eigenvalues of A are in c clusters, then a solution to Ax=b is found in c << n steps. A PROBLEM: When the CG residuals get small, the CD sampler is forced to stop after only c << n steps. Thus, covariances with well separated eigenvalues work well. The covariance of the CD samples y k ~ N(0,A -1 ) and Ay k ~ N(0,A) have the correct covariances if A’s eigenvectors in the Krylov space spanned by the residuals have small/large eigenvalues. CD sampler (CG accelerated Gibbs)

Lanczos sampler Fix the problem of small residuals is easy: hijack the iterative Lanczos eigen-solver to produce samples y k ~ N(0,A -1 ) with Var(y k ) = W k D k -1 W k T  A -1 where W k is a matrix of “Lanczos vectors”

The real issue is the spread of the eigenvalues of A. If the large ones are well separated … … then the CD sampler and Lanczos sampler can produce samples extremely quickly from a large Gaussian N(0,A) (n= 10 6 ) with the correct moments.

One extremely effective sampler for LARGE Gaussians Use a combination of the ideas presented: Generate samples with the CD or Lanczos sampler while at the same time cheaply estimating the extreme eigenvalues of G. Seed these samples and extreme eigenvalues into a Chebyshev accelerated SOR sampler

Conclusions Gaussian Processes are cool! Common techniques from numerical linear algebra can be used to sample from Gaussians Cholesky factorization (precise but expensive) Any stationary linear solver can be used as a stationary sampler (inexpensive but with geometric convergence) Polynomial accelerated Samplers – Chebyshev – Conjugate Gradients Lanczos Sampler

Estimation of Σ(θ,r) from the data using a a Markov Chain

Marginal Posteriors

Simulating the process: samples from N(μ,Σ|data) x = Σ 1/2 z + µ ~N(μ, ) TOO COURSE OF A GRID

Why is CG so fast? Gauss Siedel’s Coordinate directions CG’s conjugate directions

I used a Bayesian posterior for θ|data to construct μ(s|θ)

Simulating the process: samples from N(μ,Σ) y = Σ 1/2 z + µ ~N(μ, )

Al Parker October 29, 2009 Using linear solvers to sample large Gaussians.

Similar presentations

Presentation on theme: "Al Parker October 29, 2009 Using linear solvers to sample large Gaussians."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Al Parker October 29, 2009 Using linear solvers to sample large Gaussians.

Similar presentations

Presentation on theme: "Al Parker October 29, 2009 Using linear solvers to sample large Gaussians."— Presentation transcript:

Similar presentations

About project

Feedback