Robin Hogan, Ewan O’Connor, Andrew Barrett University of Reading, UK Maureen Dunn, Karen Johnson Brookhaven National Laboratory Objective assessment of.

Slides:



Advertisements
Similar presentations
Robin Hogan Alan Grant, Ewan O’Connor,
Advertisements

Lidar observations of mixed-phase clouds Robin Hogan, Anthony Illingworth, Ewan OConnor & Mukunda Dev Behera University of Reading UK Overview Enhanced.
Ewan OConnor, Anthony Illingworth, Robin Hogan and the Cloudnet team Cloudnet.
Quantifying sub-grid cloud structure and representing it GCMs
Ewan OConnor, Robin Hogan, Anthony Illingworth Drizzle comparisons.
Ewan OConnor, Robin Hogan, Anthony Illingworth, Nicolas Gaussiat Radar/lidar observations of boundary layer clouds.
Some questions on convection that could be addressed through another UK field program centered at Chilbolton Dan Kirshbaum 1.
Anthony Illingworth, + Robin Hogan, Ewan OConnor, U of Reading, UK and the CloudNET team (F, D, NL, S, Su). Reading: 19 Feb 08 – Meeting with Met office.
Robin Hogan Ewan OConnor, Anthony Illingworth University of Reading, UK Clouds radar collaboration meeting 17 Nov 09 Ground based evaluation of cloud forecasts.
Robin Hogan Ewan OConnor, Anthony Illingworth University of Reading, UK Chris Ferro, Ian Jolliffe, David Stephenson University of Exeter, UK Verifying.
Radar/lidar observations of boundary layer clouds
Robin Hogan Anthony Illingworth Ewan OConnor Nicolas Gaussiat Malcolm Brooks University of Reading Cloudnet products available from Chilbolton.
Robin Hogan Department of Meteorology University of Reading Cloud and Climate Studies using the Chilbolton Observatory.
Robin Hogan, Richard Allan, Nicky Chalmers, Thorwald Stein, Julien Delanoë University of Reading How accurate are the radiative properties of ice clouds.
Robin Hogan Ewan OConnor University of Reading, UK What is the half-life of a cloud forecast?
Robin Hogan Ewan OConnor, Natalie Harvey, Thorwald Stein, Anthony Illingworth, Julien Delanoe, Helen Dacre, Helene Garcon University of Reading, UK Chris.
Use of ground-based radar and lidar to evaluate model clouds
Robin Hogan Ewan OConnor Anthony Illingworth Department of Meteorology, University of Reading UK PDFs of humidity and cloud water content from Raman lidar.
Robin Hogan Ewan OConnor Damian Wilson Malcolm Brooks Evaluation statistics of cloud fraction and water content.
Clouds and their turbulent environment
Robin Hogan Ewan OConnor Cloudnet level 3 products.
Robin Hogan Julien Delanoe, Ewan OConnor, Anthony Illingworth, Jonathan Wilkinson University of Reading, UK Quantifying the skill of cloud forecasts from.
Solar Energy Forecasting Using Numerical Weather Prediction (NWP) Models Patrick Mathiesen, Sanyo Fellow, UCSD Jan Kleissl, UCSD.
Training course: boundary layer IV Parametrization above the surface layer (layout) Overview of models Slab (integral) models K-closure model K-profile.
The Problem of Parameterization in Numerical Models METEO 6030 Xuanli Li University of Utah Department of Meteorology Spring 2005.
Microphysical and radiative properties of ice clouds Evaluation of the representation of clouds in models J. Delanoë and A. Protat IPSL / CETP Assessment.
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
Daria Kluver Independent Study From Statistical Methods in the Atmospheric Sciences By Daniel Wilks.
Introduction to Probability and Probabilistic Forecasting L i n k i n g S c i e n c e t o S o c i e t y Simon Mason International Research Institute for.
Robin Hogan Anthony Illingworth Marion Mittermaier Ice water content from radar reflectivity factor and temperature.
1. The problem of mixed-phase clouds All models except DWD underestimate mid-level cloud –Some have separate “radiatively inactive” snow (ECMWF, DWD) –Met.
Ewan O’Connor Anthony Illingworth Comparison of observed cloud properties at the AMF COPS site with NWP models.
Initial 3D isotropic fractal field An initial fractal cloud-like field can be generated by essentially performing an inverse 3D Fourier Transform on the.
The DYMECS project A statistical approach for the evaluation of convective storms in high-resolution models Thorwald Stein, Robin Hogan, John Nicol, Robert.
The aim of FASTER (FAst-physics System TEstbed and Research) is to evaluate and improve the parameterizations of fast physics (involving clouds, precipitation,
COSMO General Meeting Zurich, 2005 Institute of Meteorology and Water Management Warsaw, Poland- 1 - Verification of the LM at IMGW Katarzyna Starosta,
Verification of extreme events Barbara Casati (Environment Canada) D.B. Stephenson (University of Reading) ENVIRONMENT CANADA ENVIRONNEMENT CANADA.
© Crown copyright Met Office Operational OpenRoad verification Presented by Robert Coulson.
1 On the use of radar data to verify mesoscale model precipitation forecasts Martin Goeber and Sean Milton Model Diagnostics and Validation group Numerical.
The representation of stratocumulus with eddy diffusivity closure models Stephan de Roode KNMI.
4IWVM - Tutorial Session - June 2009 Verification of categorical predictands Anna Ghelli ECMWF.
Sampling Uncertainty in Verification Measures for Binary Deterministic Forecasts Ian Jolliffe and David Stephenson 1EMS September Sampling uncertainty.
A dual mass flux framework for boundary layer convection Explicit representation of cloud base coupling mechanisms Roel Neggers, Martin Köhler, Anton Beljaars.
Heidke Skill Score (for deterministic categorical forecasts) Heidke score = Example: Suppose for OND 1997, rainfall forecasts are made for 15 stations.
Météo-France / CNRM – T. Bergot 1) Introduction 2) The methodology of the inter-comparison 3) Phase 1 : cases study Inter-comparison of numerical models.
DYMECS: Dynamical and Microphysical Evolution of Convective Storms (NERC Standard Grant) University of Reading: Robin Hogan, Bob Plant, Thorwald Stein,
Latest results in verification over Poland Katarzyna Starosta, Joanna Linkowska Institute of Meteorology and Water Management, Warsaw 9th COSMO General.
Anthony Illingworth, Robin Hogan, Ewan O’Connor, U of Reading, UK Nicolas Gaussiat Damian Wilson, Malcolm Brooks Met Office, UK Dominique Bouniol, Alain.
Sensitivity Analysis of Mesoscale Forecasts from Large Ensembles of Randomly and Non-Randomly Perturbed Model Runs William Martin November 10, 2005.
II: Progress of EDMF, III: comparison/validation of convection schemes I: some other stuff Sander Tijm (HIRLAM) Contributions of: Siebesma, De Rooy, Lenderink,
Evaluating forecasts of the evolution of the cloudy boundary layer using radar and lidar observations Andrew Barrett, Robin Hogan and Ewan O’Connor Submitted.
A Numerical Study of Early Summer Regional Climate and Weather. Zhang, D.-L., W.-Z. Zheng, and Y.-K. Xue, 2003: A Numerical Study of Early Summer Regional.
Robin Hogan Ewan O’Connor The Instrument Synergy/ Target Categorization product.
April Hansen et al. [1997] proposed that absorbing aerosol may reduce cloudiness by modifying the heating rate profiles of the atmosphere. Absorbing.
Observed & Simulated Profiles of Cloud Occurrence by Atmospheric State A Comparison of Observed Profiles of Cloud Occurrence with Multiscale Modeling Framework.
Page 1© Crown copyright 2004 The use of an intensity-scale technique for assessing operational mesoscale precipitation forecasts Marion Mittermaier and.
Stratiform Precipitation Fred Carr COMAP NWP Symposium Monday, 13 December 1999.
Trials of a 1km Version of the Unified Model for Short Range Forecasting of Convective Events Humphrey Lean, Susan Ballard, Peter Clark, Mark Dixon, Zhihong.
Overview of WG5 activities and Conditional Verification Project Adriano Raspanti - WG5 Bucharest, September 2006.
MODELING OF SUBGRID-SCALE MIXING IN LARGE-EDDY SIMULATION OF SHALLOW CONVECTION Dorota Jarecka 1 Wojciech W. Grabowski 2 Hanna Pawlowska 1 Sylwester Arabas.
Evaluation of cloudy convective boundary layer forecast by ARPEGE and IFS Comparisons with observations from Cabauw, Chilbolton, and Palaiseau  Comparisons.
Details for Today: DATE:13 th January 2005 BY:Mark Cresswell FOLLOWED BY:Practical Dynamical Forecasting 69EG3137 – Impacts & Models of Climate Change.
Robin Hogan Anthony Illingworth Marion Mittermaier Ice water content from radar reflectivity factor and temperature.
Verification methods - towards a user oriented verification The verification group.
Characteristics of precipitating convection in the UM at Δx≈200m-2km
Clouds and Large Model Grid Boxes
The DYMECS project A statistical approach for the evaluation of convective storms in high-resolution models Thorwald Stein, Robin Hogan, John Nicol, Robert.
Verifying and interpreting ensemble products
forecasts of rare events
Quantitative verification of cloud fraction forecasts
Presentation transcript:

Robin Hogan, Ewan O’Connor, Andrew Barrett University of Reading, UK Maureen Dunn, Karen Johnson Brookhaven National Laboratory Objective assessment of the skill of cloud forecasts: Towards an NWP-testbed

Overview Cloud schemes in NWP models are basically the same as in climate models, but easier to evaluate using ARM because: –NWP models are trying to simulate the actual weather observed –They are run every day –In Europe at least, NWP modelers are more interested in comparisons with ARM-like data than climate modelers (not true in US?) But can we use these comparisons to improve the physics? –Can compare different models which have different parameterizations –But each model uses different data assimilation system –Cleaner test if the setup is identical except one aspect of physics –SCM-testbed is the crucial addition to the NWP-testbed How do we set such a system up? –Start by interfacing Cloudnet processing with ARM products –Metrics: test both bias and skill (can only test bias of climate model) –Diurnal compositing to evaluate boundary-layer physics

Level 1b Minimum instrument requirements at each site –Cloud radar, lidar, microwave radiometer, rain gauge, model or sondes Radar Lidar

Level 1c Ice Liquid Rain Aerosol Instrument Synergy product –Example of target classification and data quality fields:

Level 2a/2b Cloud products on (L2a) observational and (L2b) model grid –Water content and cloud fraction L2a IWC on radar/lidar grid L2b Cloud fraction on model grid

Chilbolton Observations Met Office Mesoscale Model ECMWF Global Model Meteo-France ARPEGE Model KNMI RACMO Model Swedish RCA model Cloud fraction

Cloud fraction in 7 models Mean & PDF for 2004 for Chilbolton, Paris and Cabauw Illingworth et al. (BAMS 2007) 0-7 km –All models except DWD underestimate mid-level cloud –Some have separate “radiatively inactive” snow (ECMWF, DWD); Met Office has combined ice and snow but still underestimates cloud fraction –Wide range of low cloud amounts in models –Not enough overcast boxes, particularly in Met Office model

ARM-Cloudnet interface First step: interface ARM products to Cloudnet processing Now done at Reading: need to implement at Brookhaven –Is this a long-term solution? –Extra products and verification metrics still desirable

Skill and bias If directly evaluating a climate model, can only evaluate bias –Zero bias can often be because of compensating errors In NWP- and SCM-testbed, can also measure skill –Answers the question: was cloud forecast at the right time? –This checks whether the cloud responds to the correct forcing –Easiest to do for binary events, e.g. threshold exceedence Metrics of skill should be: –Equitability (random and constant forecasts score zero) –Robust for rare events (many scores tend to 0 or 1) –A metric with good properties is the Symmetric Extreme Dependency Score (SEDS) – Hogan et al. (2009): –Awards score of 1 to perfect forecast and 0 for random We have tested 3 models over SCP in –Apply with cloud-fraction threshold of 0.1

Southern Great Plains 2004 ECMWF NCEP UK Met Office (Hadley Centre  Met Office)

Winter 2004

Summer 2004

Microbase IWC vs. ECMWF Maureen Dunn

Longwave cooling Different mixing schemes Virtual potential temp. (  v ) Height ( z ) d  v /dz<0 Eddy diffusivity ( K m ) (strength of the mixing) Local mixing scheme (e.g. Meteo France) Local schemes known to produce boundary layers that are too shallow, moist and cold, because they don’t entrain enough dry, warm air from above (Beljaars and Betts 1992) Define Richardson Number: Eddy diffusivity is a function of Ri and is usually zero for Ri>0.25

Longwave cooling Different mixing schemes Use a “test parcel” to locate the unstable regions of the atmosphere Eddy diffusivity is positive over this region with a strength determined by the cloud-top cooling rate (Lock 1998) Virtual potential temp. (  v ) Height ( z ) Eddy diffusivity ( K m ) (strength of the mixing) Non-local mixing scheme (e.g. Met Office, ECMWF, RACMO) Entrainment velocity w e is the rate of conversion of free-troposphere air to boundary-layer air, and is parameterized explicitly

Longwave cooling Different mixing schemes Model carries an explicit variable for TKE Eddy diffusivity parameterized as K m ~TKE 1/2, where is a typical eddy size Virtual potential temp. (  v ) Height ( z ) Prognostic turbulent kinetic energy (TKE) scheme (e.g. SMHI-RCA) d  v /dz<0 d  v /dz>0 TKE generated TKE destroyed TKE transported downwards by turbulence itself

Diurnal cycle composite of clouds Barrett, Hogan & O’Connor (GRL 2009) Most models have a non-local mixing scheme in unstable conditions and an explicit formulation for entrainment at cloud top: good performance over the diurnal cycle Radar and lidar provide cloud boundaries and cloud properties above site Meteo-France: Local mixing scheme: too little entrainment SMHI: Prognostic TKE scheme: no diurnal evolution

Summary and future work One year’s evaluation over SGP –All models underestimate mid- and low-level cloud –Skill may be robustly quantified using SEDS: less skill in summer Infrastructure to interface ARM and Cloudnet data has been tested on 1 year of data with cloud fraction and IWC –So far Met Office, NCEP, ECMWF and Meteo-France can be processed –Next implement code at BNL, with other ARM products and models –Then run on many years of ARM data from multiple sites –Question: have cloud forecasts improved in 10 years? Next apply to SCM-testbed –Comparisons already demonstrate strong difference in performance of different boundary-layer parameterizations: non-local mixing with explicit entrainment is clearly best –We have the tools to quantify objectively improvements in both bias and skill with changed parameterizations in SCMs –Other metrics of performance or compositing methods required? –Could also forward-model the observations and evaluate in obs space?

Joint PDFs of cloud fraction Raw (1 hr) resolution –1 year from Murgtal –DWD COSMO model 6-hr averaging ab cd …or use a simple contingency table

a = 7194b = 4098 c = 4502d = DWD model, Murgtal Model cloud Model clear-sky a: Cloud hitb: False alarm c: Missd: Clear-sky hit Contingency tables For given set of observed events, only 2 degrees of freedom in all possible forecasts (e.g. a & b), because 2 quantities fixed: - Number of events that occurred n =a +b +c +d - Base rate (observed frequency of occurrence) p =(a +c)/n Observed cloud Observed clear-sky

Skill-Bias diagrams Positive skill Random forecast Negative skill Best possible forecast ab cd Worst possible forecast Under-prediction No bias Over-prediction Random unbiased forecast Constant forecast of non-occurrence Constant forecast of occurrence ???????????????? Reality (n=16, p=1/4) Forecast -

5 desirable properties of verification measures 1.“Equitable”: all random forecasts receive expected score zero –Constant forecasts of occurrence or non-occurrence also score zero –Note that forecasting the right cloud climatology versus height but with no other skill should also score zero 2.Difficult to “hedge” –Some measures reward under- or over-prediction 3.Useful for rare events –Almost all measures are “degenerate” in that they asymptote to 0 or 1 for vanishingly rare events 4.Dependence on full joint PDF, not just 2x2 contingency table –Difference between cloud fraction of 0.9 and 1 is as important for radiation as a difference between 0 and 0.1 –Difficult to achieve with other desirable properties: won’t be studied much today... 5.“Linear”: so that can fit an inverse exponential for half-life –Some measures (e.g. Odds Ratio Skill Score) are very non-linear

Hedging “Issuing a forecast that differs from your true belief in order to improve your score” (e.g. Jolliffe 2008) Hit rate H=a/(a+c) –Fraction of events correctly forecast –Easily hedged by randomly changing some forecasts of non-occurrence to occurrence H=0.5 H=0.75 H=1

Equitability Defined by Gandin and Murphy (1992): Requirement 1: An equitable verification measure awards all random forecasting systems, including those that always forecast the same value, the same expected score –Inequitable measures rank some random forecasts above skillful ones Requirement 2: An equitable verification measure S must be expressible as the linear weighted sum of the elements of the contingency table, i.e. S = (S a a +S b b +S c c +S d d) / n –This can safely be discarded: it is incompatible with other desirable properties, e.g. usefulness for rare events Gandin and Murphy reported that only the Peirce Skill Score and linear transforms of it is equitable by their requirements –PSS = Hit Rate minus False Alarm Rate = a/(a+c) – b/(b+d) –What about all the other measures reported to be equitable?

Some reportedly equitable measures HSS = [x-E(x)] / [n-E(x)]; x = a+dETS = [a-E(a)] / [a+b+c-E(a)] LOR = ln[ad/bc]ORSS = [ad/bc – 1] / [ad/bc + 1] E(a) = (a+b)(a+c)/n is the expected value of a for an unbiased random forecasting system Random and constant forecasts all score zero, so these measures are all equitable, right? Simple attempts to hedge will fail for all these measures

Skill versus cloud-fraction threshold Consider 7 models evaluated over 3 European sites in LOR implies skill increases for larger cloud-fraction threshold HSS implies skill decreases significantly for larger cloud- fraction threshold LORHSS

Extreme dependency score Stephenson et al. (2008) explained this behavior: –Almost all scores have a meaningless limit as “base rate” p  0 –HSS tends to zero and LOR tends to infinity They proposed the Extreme Dependency Score: –where n = a + b + c + d It can be shown that this score tends to a meaningful limit: –Rewrite in terms of hit rate H =a/(a +c) and base rate p =(a +c)/n : –Then assume a power-law dependence of H on p as p  0: –In the limit p  0 we find –This is useful because random forecasts have Hit rate converging to zero at the same rate as base rate:  =1 so EDS=0 –Perfect forecasts have constant Hit rate with base rate:  =0 so EDS=1

Symmetric extreme dependency score EDS problems: –Easy to hedge (unless calibrated) –Not equitable Solved by defining a symmetric version: –All the benefits of EDS, none of the drawbacks! Hogan, O’Connor and Illingworth (2009 QJRMS)

Skill versus cloud-fraction threshold SEDS has much flatter behaviour for all models (except for Met Office which underestimates high cloud occurrence significantly) LORHSS SEDS

Skill versus height –Most scores not reliable near the tropopause because cloud fraction tends to zero LORHSS LBSS SEDS New score reveals: –Skill tends to slowly decrease at tropopause –Mid-level clouds (4-5 km) most skilfully predicted, particularly by Met Office –Boundary-layer clouds least skilfully predicted EDS

A surprise? Is mid-level cloud well forecast??? –Frequency of occurrence of these clouds is commonly too low (e.g. from Cloudnet: Illingworth et al. 2007) –Specification of cloud phase cited as a problem –Higher skill could be because large-scale ascent has largest amplitude here, so cloud response to large-scale dynamics most clear at mid levels –Higher skill for Met Office models (global and mesoscale) because they have the arguably most sophisticated microphysics, with separate liquid and ice water content (Wilson and Ballard 1999)? Low skill for boundary-layer cloud is not a surprise! –Well known problem for forecasting (Martin et al. 2000) –Occurrence and height a subtle function of subsidence rate, stability, free-troposphere humidity, surface fluxes, entrainment rate...

Key properties for estimating ½ life We wish to model the score S versus forecast lead time t as: –where  1/2 is forecast “half-life” We need linearity –Some measures “saturate” at high skill end (e.g. Yule’s Q / ORSS) –Leads to misleadingly long half-life...and equitability –The formula above assumes that score tends to zero for very long forecasts, which will only occur if the measure is equitable

Expected values of a–d for a random forecasting system may score zero: –S[E(a), E(b), E(c), E(d)] = 0 But expected score may not be zero! –E[S(a,b,c,d)] =  P(a,b,c,d)S(a,b,c,d) Width of random probability distribution decreases for larger sample size n –A measure is only equitable if positive and negative scores cancel Which measures are equitable? ETS & ORSS are asymmetric n = 16 n = 80

Asyptotic equitability Consider first unbiased forecasts of events that occur with probability p = ½ –Expected value of “Equitable Threat Score” by a random forecasting system decreases below 0.01 only when n > 30 –This behaviour we term asymptotic equitability –Other measures are never equitable, e.g. Critical Success Index CSI = a/(a+b+c), also known as Threat Score

What about rarer events? “Equitable Threat Score” still virtually equitable for n > 30 ORSS, EDS and SEDS approach zero much more slowly with n –For events that occur 2% of the time (e.g. Finley’s tornado forecasts), need n > 25,000 before magnitude of expected score is less than 0.01 –But these measures are supposed to be useful for rare events!

Possible solutions 1.Ensure n is large enough that E(a) > 10 2.Inequitable scores can be scaled to make them equitable: –This opens the way to a new class of non-linear equitable measures 3.Report confidence intervals and “p-values” (the probability of a score being achieved by chance)

What is the origin of the term “ETS”? First use of “Equitable Threat Score”: Mesinger & Black (1992) –A modification of the “Threat Score” a/(a+b+c) –They cited Gandin and Murphy’s equitability requirement that constant forecasts score zero (which ETS does) although it doesn’t satisfy requirement that non-constant random forecasts have expected score 0 –ETS now one of most widely used verification measures in meteorology An example of rediscovery –Gilbert (1884) discussed a/(a+b+c) as a possible verification measure in the context of Finley’s (1884) tornado forecasts –Gilbert noted deficiencies of this and also proposed exactly the same formula as ETS, 108 years before! Suggest that ETS is referred to as the Gilbert Skill Score (GSS) –Or use the Heidke Skill Score, which is unconditionally equitable and is uniquely related to ETS = HSS / (2 – HSS) Hogan, Ferro, Jolliffe and Stephenson (WAF, in press)

Truly equitable Asymptotically equitable Not equitable Measure Equitable Useful for rare events Linear Peirce Skill Score, PSS Heidke Skill Score, HSS YNY Equitably Transformed SEDSYY~ Symmetric Extreme Dependency Score, SEDS ~Y~ Log of Odds Ratio, LOR~~~ Odds Ratio Skill Score, ORSS (also known as Yule’s Q) ~~N Gilbert Skill Score, GSS (formerly ETS) ~NN Extreme Dependency Score, EDSNY~ Hit rate, H False alarm rate, FAR NNY Critical Success Index, CSINNN Properties of various measures

Skill versus lead time Only possible for UK Met Office 12-km model and German DWD 7-km model –Steady decrease of skill with lead time –Both models appear to improve between 2004 and 2007 Generally, UK model best over UK, German best over Germany –An exception is Murgtal in 2007 (Met Office model wins)

Forecast “half life” Fit an inverse-exponential: –S 0 is the initial score and  1/2 is the half-life Noticeably longer half-life fitted after 36 hours –Same thing found for Met Office rainfall forecast (Roberts 2008) –First timescale due to data assimilation and convective events –Second due to more predictable large-scale weather systems days 2.9 days 2.7 days 2.9 days 2.7 days 3.1 days 2.4 days 4.0 days 4.3 days 3.0 d 3.2 d 3.1 d Met OfficeDWD

Different spatial scales? Convection? –Average temporally before calculating skill scores: –Absolute score and half-life increase with number of hours averaged Why is half-life less for clouds than pressure?

Statistics from AMF Murgtal, Germany, 2007 –140-day comparison with Met Office 12-km model Dataset released to the COPS community –Includes German DWD model at multiple resolutions and forecast lead times

Alternative approach How valid is it to estimate 3D cloud fraction from 2D slice? –Henderson and Pincus (2009) imply that it is reasonable, although presumably not in convective conditions Alternative: treat cloud fraction as a probability forecast –Each time the model forecasts a particular cloud fraction, calculate the fraction of time that cloud was observed instantaneously over the site –Leads to a Reliability Diagram: Jakob et al. (2004) Perfect No skill No resolution