Stochastic Population Forecasting and ARIMA time series modelling Lectures QMSS Summer School, 2 July 2009 Nico Keilman Department of Economics, University of Oslo
Stochastic Stochastic (from the Greek "Στόχος" for "aim" or "guess") means random. A stochastic process is one whose behaviour is non- deterministic in that a system's subsequent state is determined both by the process's predictable actions and by a random element. In a stochastic population forecast, uncertainty is made explicit: random variables are part of the forecast model.
Stochastic population forecast Future population / births / deaths /migrations as probability distributions, not one number (perhaps three)
Why Stochastic Population Forecasts (SPF)? Users should be informed about the expected accuracy of the forecast - probability of alternative future paths? - which forecast horizon is reasonable? Traditional deterministic forecast variants (e.g. High, Medium, Low) - do not quantify uncertainty Prob(MediumPop) = 0 !! - give a misleading impression of uncertainty (example later) - leave room for politically motivated choices by the user
Outline Uncertainty of population forecasts Principles of SPF Time series models (selected examples) Alho’s scaled model for error Examples from UPE Using a SPF Focus on national forecasts
How uncertain are population forecasts? Empirical findings – historical forecasts evaluated against actual population numbers (ex post facto)
Main findings for official forecasts in Western countries Uncertainty in forecasts of certain population variables surprisingly large Forecasts for the young and the old age groups are the least reliable Forecast errors increase as forecast interval lengthens Large uncertainty for small countries Large uncertainty for countries that are strongly affected by migration European forecasts have not become more accurate since WW2
Errors in age structure forecasts Europe
United Kingdom - men
United Kingdom - women
Why uncertain? Data quality (LDC’s) Social science predictions, no accurate behavioural theory Rely on observed regularities instead Problems when sudden trend shifts occur - stagnation life expectancy men 1950s - baby boom/baby bust
Traditional population forecasts do not give a correct impression of uncertainty
Example: Old Age Dependency Ratio (OADR) for Norway in 2060 Source: Statistics Norway population forecast of 2005 HighMiddle Low|H-L|/M millions (%) POP POP OADR
Two major problems Wide margins for some variables, narrow margins for others Narrow margins in the short run, wide margins in the long run - implicitly assumed perfect autocorrelation (and sometimes perfect correlation across components)
Coverage probabilities for H-L margin of total population in official forecasts Statistics Norway47%78% Statistics Sweden -Fertility19%32% -Mortality 4%20% -Migration 1%34% Sources: Stochastic population forecasts from UPE Traditional forecasts from Statistics Norway and Statistics Sweden
Cohort-component method Deterministic population forecast Needed for the country in question: annual assumptions on future –Fertility Total Fertility Rate –Mortality Life expectancy at birth M/F –Migration Net immigration –as well as rates (fertility, mortality) & numbers (migration) by age & sex
Stochastic Population Forecast: How? Cohort-component method Random rates for fertility and mortality, random numbers for net-migration Normal distributions in the log scale (rates) or in the original scale (migration numbers) - expected values (“point predictions”) – cf. Medium variant in traditional deterministic forecast - standard deviations - correlations (age, time, sex, components, countries)
SPF: How? (cntnd) Joint distribution of all random input variables (rates, migration numbers) In practice: simplifications, e.g. - independence of components (fertility, mortality, migration) - correlation between male and female mortality (constant across ages, time) One random draw from all prob. distributions one sample path Repeated draws thousands of sample paths
SPF: How? (cntnd) Three main approaches: uncertainty parameters based on -historical errors -expert knowledge -statistical model
SPF: Examples Multivariate time series models for all parameters of interest Examples for Norway , see and European countries , see Alho’s scaled model for error, implemented in PEP (Program for Error Propagation) Example for aggregate of 18 European countries , see
Time series example, Norway: log(TFR) = ARIMA(1,1,0) Z t = 0.67Z t-1 + ε t-1, Z t = log(TFR t ) - log(TFR t-1 ) (0.10)
Prediction intervals, age-specific fertility rates, Norway 2050
Time series models for parameters of Gamma model for age-specific fertility (TFR, MAC, variance in age at childbearing) e0 parameters of Heligman-Pollard model for age- specific mortality immigration numbers emigration numbers (deterministic age patterns for both migration flows) 5000 simulations
Population size, Norway
Time series models, two examples 1. Autoregressive model of order 1 - AR(1) Z t = φZ t-1 + ε t |φ| < 1, ε t i.i.d random variables, zero expectation, constant variance – ”white noise” Var(Z t ) = Var(ε t )(1- φ 2t )/(1- φ 2 ) constant (in the long run – large t) For large t: k-step ahead autocorrelation Corr(Z t, Z t+k ) equals φ k, independent of time
2. Random Walk - RW Z t = Z t-1 + ε t Var(Z t ) = t*Var(ε t ) unbounded for large t Independent increments (zero autocorrelation)
Forecasts and 95% prediction intervals for net migration. Data Outliers: 1989 AR(1) & const: Z t = Z t-1 +ε t Outliers: 1962, 1988 AR(1) & const: Z t = Z t-1 +ε t
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data Observed TFR-value for the year 2000 is given as “y2000” Model: AR(1) & constant Z t (=logTFR t ) = Z t-1 + ε t
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data Observed TFR-value for the year 2000 is given as “y2000” Model: AR(1) & constant Outliers 1920, 1942 Z t (=logTFR t ) = Z t-1 + ε t
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data Observed TFR-value for the year 2000 is given as “y2000” Model: AR(2) & constant Z t (=logTFR t ) = Z t Z t-2 + ε t
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data Observed TFR-value for the year 2000 is given as “y2000” Model: AR(2)-ARCH(1) Outliers 1919, 1920, 1940, 1941 Z t (=logTFR t ) = Z t-1 + v t + dummies v t = v t-2 + ε t, ε t = (√h t )e t, h t = 7E (ε t 2 )
Time series approach to SPF + conceptually simple - inflexible Alternative: Alho’s scaled model for error Implemented in Program for Error Propagation (PEP) htm. htm
Scaled model for error Suppose the true age-specific rate in age j during forecast year t > 0 is of the form R(j,t) = F(j,t)exp(X(j,t)), where F(j,t) is the point forecast, and X(j,t) is the relative error
Suppose that the error processes are of the form X(j,t) = ε(j,1) ε(j,t) with error increments of the form ε(j,t) = S(j,t)(η j + δ(j,t)) S(j,t) deterministic scales. δ(j,t) are independent over time t. δ(j,t) are independent of η j for all t and j η j ~ N(0, κ), δ(j,t) ~ N(0, 1 - κ), 0 ≤ κ ≤ 1 Note that Var(ε(j,t)) = S(j,t) 2 A positive kappa means that there is systematic error in the time trend of the rate.
κ = Corr[ε(j,t), ε(j,t+h)] for all h > 0, thus κ is the (constant) autocorrelation between the error increments. Together, the autocorrelation κ and the scale S(j,t) determine the variance of the relative error X(j,t). Ex. 1. Under a random walk model the error increments are uncorrelated with κ = 0. Ex. 2. The model with constant scales (S(j,t)=S(j)) can be interpreted as a random walk with a random drift. The relative importance of the two components is determined by κ.
Migration Migration (net) is represented in absolute terms Dependence on age is deterministic, given by a fixed distribution g(j,x) over age x The error of net migration in age x, for sex j, during year t > 0, is additive and of the form Y(j,x,t) = S(j,t)g(j,x)(η j + δ(j,t))
Key properties of the scaled model The choice of the scales S(j,t) is unrestricted. Hence any sequence of non-decreasing error variances can be matched (e.g. heteroscedasticity) Any sequence of cross-correlations over ages can be majorized using the AR(1) models of correlation Any sequence of autocorrelations for the error increments can be majorized.
Scaled model for error Used for UPE project: Uncertain Population of Europe 18 countries: EU15 + Iceland, Norway, Switzerland (EEA+) 2003 – 2050 Probability distributions specified on the basis of - time series analysis (TFR, e0, net-migr.) - empirical forecast errors - expert judgement 3000 simulations for each country, PEP
Population size EEA+ median (black), 80% prediction intervals (red) 77% chance > 400 million in 2050 (UN) 83% chance > 392 million in 2050 (2003)
median (black), 80% prediction intervals (red)
How to use SPF results? User’s Loss function What are the costs associated with underpredictions/ overpredictions of certain sizes?
Loss function, stylized example F = forecast O = observed Loss= c.(F - O)F > O(c, λ > 0) = λ.c.(O - F)F < O λ characterizes degree of symmetry in the loss function λ > 1: underprediction is more severe than overprediction
Forecast F is a stochastic variable with a predictive distribution Hence Loss is a s.v., which has a distribution Compute expected Loss Pick that value of F, which minimizes expected Loss The optimal F is that value of F at which the statistical distribution function equals λ /(λ +1) λ =1: median value of F λ > 1: optimal F is larger than the median
e62 ~ Normal(20, stdev) λ > 1: underprediction is more severe than overprediction
Important Are overpredictions more/less harmful than underpredictions?
Challenges Multi-state forecasts (sub-national, household) Limited data Educate the users
Thank you!
Autocorrelation of error increments The error processes are of the form X(j,t) = ε(j,1) ε(j,t) with error increments of the form ε(j,t) = S(j,t)(η j + δ(j,t)) S(j,t) deterministic scales. δ(j,t) are independent over time t. δ(j,t) are independent of η j for all t and j η j ~ N(0, κ), δ(j,t) ~ N(0, 1 - κ), 0 ≤ κ ≤ 1 A positive kappa means that there is systematic error in the time trend of the rate.
UPE: age specific fertility rates We assumed that kappa = 0 random walk, non-correlated error increments ε(j,t) = S(j,t)δ(j,t) δ(j,t) i.i.d. ~ N(0, 1) Example Italy Pop aged 0 in 2050: - Expected value = 474,000 - Median= 420,000 - Standard deviation = 261,000 - Coefficient of variation = 0.55
Alternative assumption: kappa = 0.05 Italy Pop aged 0 in 2050: - Expected value = 678,000 - Median= 457,000 - Standard deviation = 794,000 - Coefficient of variation = 1.17 Kappa = 0.1 gives unrealistically wide prediction intervals for Pop aged 0 in 2050
EEA+ 15 EU countries: Austria, Belgium, Denmark, Finland, France, Germany, Greece, Italy, Ireland, Luxembourg, Netherlands, Portugal, Spain, Sweden, United Kingdom Iceland, Norway, Switzerland
Net migration to the countries of the EEA+: upward trend
Net migration to Italy
UPE assumptions for net migration Increase to ca. 3.5 ‰ by 2050 for the whole of the EEA+ Demand for labour (ageing, economic developments) North – South divide
UPE assumptions for mortality By 2030, mortality reductions in EEA+ countries will follow a common pattern Sex gap of life expectancy reduces to 4 years Life expectancy gains to 2050 by 6.5 (NL) -10 (Lux, Pt, E) years for men 5.7 (NL) – 9.6 (EIR) years for women On average 2-3 years higher than Eurostat/UN
UPE life expectancies too high? under Historically, increases in European life expectancies have been under-estimated by - 2 years (15 years ahead) years (25 years ahead) Record life expectancy is higher & increases faster than UPE - ca years per calendar year
UPE assumptions for fertility Mediterranean and German speaking countries low - little catching up - problems with child care facilities, housing - preference for one child Total Fertility Rate = 1.4 c/w Western and Northern Europe Total Fertility Rate = 1.8 c/w Similar to Eurostat, on average 0.2 c/w lower than UN
UPE: probabilistic forecast Similar method as UN, Eurostat (cohort-component) But parameters are drawn from assumed distributions -- simulation Volatility in fertility, mortality, migration Autocorrelations Correlations across ages, sexes, countries
Population size medians (black) and 80% prediction intervals (red) 2050 SCB10.5 mln SSB 4.8 mln
Age pyramid 2050 medians & 80 % prediction intervals
UPE assumptions Sweden 2050 exp.80%L 80%H SCB value TFR e0M e0F migr