Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.

Slides:



Advertisements
Similar presentations
Bayesian tools for analysing and reducing uncertainty Tony OHagan University of Sheffield.
Advertisements

EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
Insert Date HereSlide 1 Using Derivative and Integral Information in the Statistical Analysis of Computer Models Gemma Stephenson March 2007.
Probabilistic analog of clustering: mixture models
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Kriging.
Cost of surrogates In linear regression, the process of fitting involves solving a set of linear equations once. For moving least squares, we need to form.
Pattern Recognition and Machine Learning: Kernel Methods.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Cost of surrogates In linear regression, the process of fitting involves solving a set of linear equations once. For moving least squares, we need to.
Dimension reduction (1)
Validating uncertain predictions Tony O’Hagan, Leo Bastos, Jeremy Oakley, University of Sheffield.
Gaussian Processes I have known
Climate case study. Outline The challenge The simulator The data Definitions and conventions Elicitation Expert beliefs about climate parameters Expert.
Pattern Recognition and Machine Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Dimensional reduction, PCA
Spike-triggering stimulus features stimulus X(t) multidimensional decision function spike output Y(t) x1x1 x2x2 x3x3 f1f1 f2f2 f3f3 Functional models of.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Course AE4-T40 Lecture 5: Control Apllication
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Chapter 6-2 Radial Basis Function Networks 1. Topics Basis Functions Radial Basis Functions Gaussian Basis Functions Nadaraya Watson Kernel Regression.
Separate multivariate observations
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Gaussian process modelling
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Calibration and Model Discrepancy Tony O’Hagan, MUCM, Sheffield.
Principles of Pattern Recognition
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
Slide 1 Marc Kennedy, Clive Anderson, Anthony O’Hagan, Mark Lomas, Ian Woodward, Andreas Heinemayer and John Paul Gosling Uncertainty in environmental.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
17 May 2007RSS Kent Local Group1 Quantifying uncertainty in the UK carbon flux Tony O’Hagan CTCD, Sheffield.
Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Gaussian Processes Li An Li An
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Gaussian Processes For Regression, Classification, and Prediction.
Options and generalisations. Outline Dimensionality Many inputs and/or many outputs GP structure Mean and variance functions Prior information Multi-output,
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Filters– Chapter 6. Filter Difference between a Filter and a Point Operation is that a Filter utilizes a neighborhood of pixels from the input image to.
Introduction to emulators Tony O’Hagan University of Sheffield.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
8 Sept 2006, DEMA2006Slide 1 An Introduction to Computer Experiments and their Design Problems Tony O’Hagan University of Sheffield.
Kriging - Introduction Method invented in the 1950s by South African geologist Daniel Krige (1919-) for predicting distribution of minerals. Became very.
Marc Kennedy, Tony O’Hagan, Clive Anderson,
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Overview of Supervised Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Machine Learning Math Essentials Part 2
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Test #1 Thursday September 20th
Presentation transcript:

Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield

Outline  Gaussian process emulators  Simulators and emulators  GP modelling  Multiple outputs  Covariance functions  Independent emulators  Transformations to independence  Convolution  Outputs as extra dimension(s)  The multi-output (separable) emulator  The dynamic emulator  Which works best?  An example

Simulators and emulators  A simulator is a model of a real process  Typically implemented as a computer code  Think of it as a function taking inputs x and giving outputs y  y = f(x)  An emulator is a statistical representation of the function  Expressing knowledge/beliefs about what the output will be at any given input(s)  Built using prior information and a training set of model runs  The GP emulator expresses f as a GP  Conditional on hyperparameters

GP modelling  Mean function  Regression form h(x) T β  Used to model broad shape of response  Analogous to universal kriging  Covariance function  Stationary  Often use the Gaussian form σ 2 exp{-(x-x ′ ) T D -2 (x-x ′ )}  D is diagonal with correlation lengths on diagonal  Hyperparameters β, σ 2 and D  Uninformative priors

The emulator  Then the emulator is the posterior distribution of f  After integrating out β and σ 2, we have a t process conditional on D  Mean function made up of fitted regression h T β* plus smooth interpolator of residuals  Covariance function conditioned on training data  Reproduces training data exactly  Important to validate  Using a validation sample of additional runs  Check that emulator predicts these runs to within stated accuracy  No more and no less  Bastos and O’Hagan paper on MUCM website

Multiple outputs  Now y is a vector, f is a vector function  Training sample  Single training sample for all outputs  Probably design for one output works for many  Mean function  Modelling essentially as before, h i (x) T β i for output i  Probably more important now  Covariance function  Much more complex because of correlations between outputs  Ignoring these can lead to poor emulation of derived outputs

Covariance function  Let f i (x) be i-th output  Covariance function  c((i,x), (j,x ′) ) = cov[f i (x), f j (x ′ )]  Must be positive definite  Space of possible functions does not seem to be well explored  Two special cases  Independence: c((i,x), (j,x ′) ) = 0 if i ≠ j  No correlation between outputs  Separability: c((i,x), (j,x ′) ) = σ ij c x (x, x ′ )  Covariance matrix Σ between outputs, correlation c x between inputs  Same correlation function c x for all outputs

Independence  Strong assumption, but...  If posterior variances are all small, correlations may not matter  How to achieve this?  Good mean functions and/or  Large training sample  May not be possible in practice, but...  Consider transformation to achieve independence  Only linear transformations considered as far as I’m aware  z(x) = A y(x)  y(x) = B z(x)  c((i,x), (j,x ′) ) is linear mixture of functions for each z

Transformations to independence  Principal components  Fit and subtract mean functions (using same h) for each y  Construct sample covariance matrix of residuals  Find principal components A (or other diagonalising transform)  Transform and fit separate emulators to each z  Dimension reduction  Don’t emulate all z  Treat unemulated components as noise  Linear model of coregionalisation (LMC)  Fit B (which need not be square) and hyperparameters of each z simultaneously

Convolution  Instead of transforming outputs for each x separately, consider  y(x) = ∫ k(x,x*) z(x*) dx*  Kernel k  Homogeneous case k(x-x*)  General case can model non-stationary y  But much more complex

Outputs as extra dimension(s)  Outputs often correspond to points in some space  Time series outputs  Outputs on a spatial or spatio-temporal grid  Add coordinates of the output space as inputs  If output i has coordinates t then write f i (x) = f*(x,t)  Emulate f* as single output simulator  In principle, places no restriction on covariance function  In practice, for single emulator we use restrictive covariance functions  Almost always assume separability -> separable y  Standard functions like Gaussian correlation may not be sensible in t space

The multi-output emulator  Assume separability  Allow general Σ  Use same regression basis h(x) for all outputs  Computationally simple  Joint distribution of points on multivariate GP have matrix normal form  Can integrate out β and Σ analytically

The dynamic emulator  Many simulators produce time series output by iterating  Output y t is function of state vector s t at time t  Exogenous forcing inputs u t, fixed inputs (parameters) p  Single time-step simulator f*  s t+1 = f*(s t, u t+1, p)  Emulate f*  Correlation structure in time faithfully modelled  Need to emulate accurately  Not much happening in single time step but need to capture fine detail  Iteration of emulator not straightforward!  State vector may be very high-dimensional

Which to use?  Big open question!  This workshop will hopefully give us lots of food for thought  MUCM toolkit v3 scheduled to cover these issues  All methods impose restrictions on covariance function  In practice if not in theory  Which restrictions can we get away with in practice?  Dimension reduction is often important  Outputs on grids can be very high dimensional  Principal components-type transformations  Outputs as extra input(s)  Dynamic emulation  Dynamics often driven by forcing

Example  Conti and O’Hagan paper  On my website:  Time series output from Sheffield Global Dynamic Vegetation Model (SDGVM)  Dynamic model on monthly timestep  Large state vector, forced by rainfall, temperature, sunlight  10 inputs  All others, including forcing, fixed  120 outputs  Monthly values of NBP for ten years

Multi-output emulator on left, outputs as input on right For fixed forcing, both seem to capture dynamics well Outputs as input performs less well, due to more restrictive/unrealistic time series structure

Conclusions  Draw your own!