Bayesian tools for analysing and reducing uncertainty Tony OHagan University of Sheffield
Or …
Uncertainty, Complexity and Predictive Reliability of (environmental/biological) process models
Summary Uncertainty Complexity Predictive Reliability
Uncertainty is everywhere … Internal parameters Initial conditions Forcing inputs Model structure Observational error Code uncertainty
Uncertainty (2) And all sources of uncertainty must be recognised quantified Otherwise we dont know how good model predictions are how to use data
Tasks involving uncertainty Whether or not we have data Sensitivity analysis Uncertainty analysis Interacting with observational data Calibration Data assimilation Discrepancy estimation Validation
Complexity This is already a big task It is massively exacerbated by model complexity High dimensionality Long model run times But there are powerful statistical tools available
Its a big task Quantifying uncertainty is often difficult Unfamiliar task Need for expert statistical skills Statistical modelling Elicitation It deserves to be recognised as a task of comparable status to developing the model And EMS is all about respecting each others expertise
Computational complexity All the tasks involving uncertainty can be computed by simple (MC)MC methods if the model runs quickly enough Otherwise emulation is needed Requires orders of magnitude fewer model runs
Emulation A computer model encodes a function, that takes inputs and produces outputs An emulator is a statistical approximation of that function NOT just an approximation Estimates what outputs would be obtained from given inputs With statistically valid measure of uncertainty
Emulators Multiple regression models Do not make valid uncertainty statements Neural networks Can make valid uncertainty statements but complex Data based mechanistic models Do not make valid uncertainty statements Gaussian processes
GPs Gaussian process emulators are nonparametric make no assumptions other than smoothness estimate the code accurately with small uncertainty and run instantly So we can do uncertainty based tasks fast and efficiently Conceptually, we use model runs to learn about the function then derive any desired properties of model
2 code runs Consider one input and one output Emulator estimate interpolates data
2 code runs Emulator uncertainty grows between data points
3 code runs Adding another point changes estimate and reduces uncertainty
5 code runs And so on
Smoothness It is the basic assumption of a (homogeneously) smooth, continuous function that gives the GP its computational advantages The actual degree of smoothness concerns how rapidly the function wiggles A rough function responds strongly to quite small changes in inputs We need many more data points to emulate accurately a rough function over a given range
Effect of Smoothness Smoothness determines how fast the uncertainty increases between data points
Estimating smoothness We can estimate the smoothness from the data This is obviously a key Gaussian process parameter to estimate But tricky Need robust estimate Validate by predicting left-out data points
Code uncertainty Emulation, like MC, is just a computational device But a highly efficient one! Like MC, quantities of interest are computed subject to error Statistically quantifiable and validatable Reducible if we can do more model runs This is code uncertainty
And finally … Predictive Reliability
What can we do with observational data? Model validation Check observations against predictive distributions based on current knowledge Calibration Data assimilation Model correction Learn about values of uncertain model parameters (possibly including model structure) For dynamic models, learn about the current value of the state vector Learn about model discrepancy function Do all of these (in one coherent Bayesian system)
Doing it all Its crucial to model uncertainties carefully to avoid using data twice to apportion observation error between parameters, state vector and model discrepancy to get appropriate learning about all these Data assimilation alone is useful only for short term prediction
This is challenging We (Sheffield and Durham) have developed theory and serious case studies Growing practical experience But still lots to do, both theoretically and practically Each new model poses new challenges Our science is as exciting and challenging as any other
Sorry … We are not yet at the stage where implementation is routine Very limited software Most publications in the statistics literature But were working on it And were very willing to interact with modellers/users in any discipline Particularly if you have resources!
Who we are Sheffield Tony OHagan Marc Kennedy, Stefano Conti, Jeremy Oakley Durham Michael Goldstein Peter Craig, Jonathan Rougier, Alan Seheult