Clustering Techniques and their applications at ECMWF

Clustering Techniques and their applications at ECMWF
Susanna Corti European Centre for Medium-Range Weather Forecasts

Outline Cluster analysis - Generalities New cluster product at ECMWF
Metrics Clustering techniques Suitable (sub)spaces of states K-means method Significance Number of clusters Examples New cluster product at ECMWF Concept Visualisation on ECMWF web-site Predictability of Euro-Atlantic regimes

Cluster analysis - Generalities
“Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the “correct number” of groups into which the data should be sorted is not known in advance.” Daniel S. Wilks Examples of use of cluster analysis in weather and climate literature: Grouping daily weather observations into synoptic types (Kalkstein et al. 1987) Defining weather regimes from upper air flow patterns (Mo and Ghil 1998; Molteni et al. 1990) Grouping members of forecast ensembles (Tracton and Kalnay 1993; Molteni et al 1996; Legg et al 2002) General comment: the authors seem to confuse clustering and regimes. "Clustering" means that all weather patterns are not equiprobable. On can aggregate the situations (e.g. Z700 fields) around "centroids", and these clusters have a minimum internal variance. This is a spatial property, and is unchanged if the chronology is shuffled. "Regime" means that some situations are persistent, and if you except transition days, you stay one to several weeks inside the same regime. The present paper deals essentially with clustering, although in some places the notion of duration inside a cluster is mentioned. Clustering is a very useful technique, which strongly reduces the number of degrees of freedom and facilitates the analysis. It can be applied to many fields, but requires long datasets. Regime analysis is limited to very few fields where attractors can be identified.

Example – Grouping members of Forecast Ensembles

Clustering analysis - Metrics
“Central to the idea of the clustering of the data points is the idea of distance. Clusters should be composed of points separated by small distances, relative to the distances between clusters.” Daniel S. Wilks Weighted Euclideian distance beteen two vectors xi and xj Minkowski metric Other possible choices of distances: Karl Pearson distance: The weights are the reciprocals of the corresponding variances (i.e. wk=1/sk,k) Anomaly correlation

Cluster analysis - Clustering techniques
With the terminology clustering techniques we refer to a set of different methodologies used to identify groups of elements gathered together in a (suitable) (sub)space of states. Such methodologies can be summarised in essentially three different typologies. Hierarchical Clustering Hierarchy of sets of groups, each of which is formed by merging one pair from the collection of previously defined groups. We start with a number of groups equal to the number n of observations/states. The first step is to find the two groups that are closest and combine them into a new group. Once a data vector has been assigned to a group, it is not removed. This process continues until, at the final (n-k)th step all the n observations have been aggregated into k groups. Nonhierarchical Clustering Methods that allow reassignment of observations as the analysis proceeds. Bump-hunting methods Quest of local maxima in the Probability Density Function of states.

Clustering analysis – Suitable (Sub)spaces of states
Clustering techniques are effective only if applied in a L-dimensional phase space with L << N (N=number of elements in the data set in question). If the actual space of states is too large (ex: 500 maps with 25x45 grid points) it is advisable to compute the clusters in a suitable sub-space. EOF decomposition. The first EOF expresses the maximum fraction of the variance of the original data set. The second explains the maximum amount of variance remaining with a function which is orthogonal to the first, and so on. To be useful EOF analysis must result in an decomposition of the data in which a big fraction of the variance is explained by the first few EOFs. Explained variance Accumulated variance

S3- DJFM 2009-2010 NAO- frequency
Examples of bump-hunting S3- DJFM NAO- frequency After Corti Molteni and Palmer 1999 Nature EOF2 NAO- frequency PDF EOF1

Cluster analysis - K-means method
The most widely used non-hierarchical clustering approach is called K-means method. K is the number of clusters into which the data will be grouped (this number must be specified in advance). For a given number k of clusters, the optimum partition of data into k clusters is found by an algorithm that takes an initial cluster assignment (based on the distance from pseudorandom seed points), and iteratively changes it by assigning each element to the cluster with the closest centroid, until a ‘‘stable’’ classification is achieved. (A cluster centroid is defined by the average of the PC coordinates of all states that lie in that cluster.) This process is repeated many times (using different seeds), and for each partition the ratio r*k of variance among cluster centroids (weighted by the population) to the average intra-cluster variance is recorded. The partition that maximises this ratio is the optimal one. 1.For a given number k of clusters, the optimum partition of data into k clusters is found by an algorithm that takes an initial cluster assignment (based on the distance from pseudorandom seed points), and iteratively changes it by assigning each element to the cluster with the closest centroid, until a ‘‘stable’’ classification is achieved. (A cluster centroid is defined by the average of the PC coordinates of all states that lie in that cluster.) This process should be repeated (at least) 100 times (using different seeds), and for each partition the ratio of variance among cluster centroids (weighted by the population) to the average intra-cluster variance is recorded. (A cluster centroid is defined by the average of the PC coordinates of all states that lie in that cluster.) This process should be repeated (at least) 100 times (using different seeds), and for each partition the ratio of variance among cluster centroids (weighted by the population) to the average intra- cluster variance is recorded.

Cluster analysis - Significance
The goal is to assess the strength of the clustering compared to that expected from an appropriate reference distribution, such as a multidimensional Gaussian distribution. In assessing whether the null hypothesis of multi-normality can be rejected, it is therefore necessary to perform Monte-Carlo simulations using a large number M of synthetic data sets. Each synthetic data set has precisely the same length as the original data set against which it is compared, and it is generated from a series of n dimensional Markov processes, whose mean, variance and first-order auto-correlation are obtained from the observed data set. A cluster analysis is performed for each one of the simulated data sets. For each k-partition the ratio rmk of variance among cluster centroids to the average intra-cluster variance is recorded. Since the synthetic data are assumed to have a unimodal distribution, the proportion Pk of red-noise samples for which rmk < r*k is a measure of the significance of the k-cluster partition of the actual data, and 1- Pk is the corresponding confidence level for the existence of k clusters.

Cluster analysis - How many clusters?
The need of specifying the number of clusters can be a disadvantage of K-means method if we don’t know in advance what is the best cluster partition of the data set in question. However there are some criteria that can be used to choose the optimal number of clusters. Significance: partition with the highest significance with respect to predefined Multinormal distributions Reproducibility: We can use as a measure of reproducibility the ratio of the mean-squared error of best matching cluster centroids from a N pairs of randomly chosen half-length datasets from the full actual one. The partition with the highest reproducibility will be chosen. Consistency: The consistency can be calculated both with respect to variable (for example comparing clusters obtained from dynamically linked variables) and with respect to domain (test of sensitivities with respect to the lateral or vertical domain).

Use of K-means to identify regime structures in observed and simulated datasets
NAO+ BL AR NAO- ERA T159 T1279 Simulating regime structures in weather and climate prediction models A. Dawson, T. N. Palmer, and S. Corti – GRL 2012

New cluster product at ECMWF
The ECMWF clustering is one of a range of products that summarise the large amount of information in the Ensemble Prediction System (EPS). The clustering gives an overview of the different synoptic flow patterns in the EPS. The members are grouped together based on the similarity between their 500 hPa geopotential fields over the North Atlantic and Europe. The new cluster products were implemented in operations in November They are archived in MARS and available to forecast users through the operational dissemination of products. A graphical product using the new clustering is available for registered users on the ECMWF web site: newclusters/ First: we define for each season and each selected region climatological weather regimes which identify the main large scale patterns of the region. The identification is based on reanalysis dataset. Second: we identify specific forecast scenarios from the real time EPS forecasts. This is done for different time-window. NB> The same clustering algorithm based on a version of the k-means method modified by Straus and Molteni (JClim. 2004) is used for step 1 and 2, but in the first case clusters represent “climatological regimes”, in the second case they are aggregations of EPS trajectories in the specific time window, which we call “forecast scenarios”. Third: We associate each forecast scenario to the closest climatological regime. So far we have been focusing on the European Area during the winter. So far we focus on the winter season, here on I’m going to present results for the past winter.

New cluster product at ECMWF: large scale climatological regimes
To put the daily clustering in the context of the large-scale flow and to allow the investigation of regime changes, the new ECMWF clustering contains a second component. Each cluster is attributed to one of a set of four pre-defined climatological regimes Positive phase of the North Atlantic Oscillation (NAO). Euro-Atlantic blocking. Negative phase of the North Atlantic Oscillation (NAO). Atlantic ridge. First: we define for each season and each selected region climatological weather regimes which identify the main large scale patterns of the region. The identification is based on reanalysis dataset. Second: we identify specific forecast scenarios from the real time EPS forecasts. This is done for different time-window. NB> The same clustering algorithm based on a version of the k-means method modified by Straus and Molteni (JClim. 2004) is used for step 1 and 2, but in the first case clusters represent “climatological regimes”, in the second case they are aggregations of EPS trajectories in the specific time window, which we call “forecast scenarios”. Third: We associate each forecast scenario to the closest climatological regime. So far we have been focusing on the European Area during the winter. So far we focus on the winter season, here on I’m going to present results for the past winter.

Climatological Regimes in the cold season Euro-Atlantic Region
500 hPa Geopotential height – 29 years of ERA INTERIM ONDJFM Positive NAO 32.3% Negative NAO 21.4% Euro-Atlantic Blocking 26.1% Atlantic Ridge 20.2%

New cluster product at ECMWF: 2-stage process
1st step: (to be done once per season) Identification of the climatological weather regimes over selected regions for every season. 2nd step: (to be done for every forecast) Identification of forecast scenarios from the real-time EPS forecasts. Association of each forecast scenario to the closest climatological weather regime. First: we define for each season and each selected region climatological weather regimes which identify the main large scale patterns of the region. The identification is based on reanalysis dataset. Second: we identify specific forecast scenarios from the real time EPS forecasts. This is done for different time-window. NB> The same clustering algorithm based on a version of the k-means method modified by Straus and Molteni (JClim. 2004) is used for step 1 and 2, but in the first case clusters represent “climatological regimes”, in the second case they are aggregations of EPS trajectories in the specific time window, which we call “forecast scenarios”. Third: We associate each forecast scenario to the closest climatological regime. So far we have been focusing on the European Area during the winter. So far we focus on the winter season, here on I’m going to present results for the past winter.

Regimes & Scenarios R1 NAO + R2 Blocking R3 NAO - R4 Atl-Ridge
Let’s try to explain now our two step strategy using a simple cartoon in 2 dimensions. We have 4 cells along the y axis (the fixed regimes), while the x axis represents the lead time in days. We decided to consider four different time windows: 3-4 days, 5-7 days, 8-10 and Let’s start the integration now. Only 10 members are pictured for simplicity. Then, for each chosen time window, we compute the optimal number of weather scenarios which should provide the best possible grouping of the 51 EPS members. In this case we have 3 in the first window, 2 in the second, 3 in the third and 4 in the fourth (we allow a maximum number of 6 scenarios). Each forecast scenario is represented by the EPS member closest to the cluster mean. Furthermore it is important to notice that: Within a time window scenarios are consistent. In other words the clustering is computed for all the lead times in the window. So if for example member 4 belongs to S1 at lead time 5, it will belong to S1 at lead time 6 and 7. This is consistent with the actual operational cluster product. Scenarios computed at different time windows are in principle totally independent. They can change in number, associated members, position. For each time window and for each time step, we associate the scenarios to the closest regime. Depending on the ensemble spread and on the evolving weather systems we can have here different situations. Like in window 3-4 all the scenarios belong to the same regime, or like in window 5-7 they belong to different regimes. On can have even regime transitions within a scenario. Like for example S4 which is associated to R3 at lead time 11, but then there is a transition and for the following lead times it falls in R4. Lead Time [days]

Regime transitions within a time window
Day 5 to day February 2011 – 3 scenarios 2 possible transitions

72_96 120_168 192_240 264_360 The new clustering product show geopotential height at 500hPa from specific EPS members (scenarios) that better represent the cluster centroids. The scenarios are plotted as full fields and the anomalies are overlayed in colour shading. The clustering is based on the 500 hPa geopotential forecast fields and is performed over 4 sections of the forecast trajectories: 72-96, , and hours. When the ensemble distribution is sufficiently homogeneous the cluster algorithm cannot partition the ensemble in a meaningful way, for these cases the ensemble member that better represents the ensemble mean is shown. Each scenario, defined at 500 hPa, is associated to one of 4 pre-defined large scale climatological regimes. The frame colour of each plot represents the association to the climatological regimes. Scenario with a red frame indicates that its large scale features resembles the ones represented by the blocking, blue indicates association with Positive NAO pattern, green indicates association with negative NAO and violet with the Atlantic ridge pattern.

Transition from blocking to westerly flow:
11 January 2009 18 January 2009 Now we will focus on a specific case. A transition from blocking to westerly flow and how is has been represented in the EPS. Here there are the analyses of geopotential at 500hPa and 2mt temperature for the 11 and the 18 of Jan We can see a transition from a typical blocked flow to a zonal one. We can see the negative anomaly of temperature over central Europe during the blocking which evolve toward milder conditions. This case has been pointed out to us by Thomas Shumann. We took the forecasts started the 11 of Jan and we looked at the time window of 5-7 days. Has to be pointed out that the operational clustering gave 1 cluster in this time window which represented a blocked flow.

Regime transition within a time window
Lead Time [days] R2 Blocking R1 NAO + S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 Here we show how the new cluster product might look on the WEB. Each row represents a weather scenario, each column a time step in the chosen window. Each map frame is colored with the regime to which the sceanrio belongs. Here we show three time steps. For that time window we found 5 cluster, each one is represented by a weather scenario, that is the ensemble member closest to the cluster mean. All 5 scenarios were all blocked at lead time 5 and lead time 6, but 2 of them make a transitions to the NAO + regime at lead time 7. This case is illustrated in the cartoon where the big grey trajectories represent the 5 scenarios, that is the five ensemble members representative of each cluster. Day Day Day 7 16 Jan Jan Jan

Scenarios fall into two different Regimes at day 7
16 13 Positive NAO S2 16 S1 4 Blocking S3 Here we can see better the 5 scenarios at day 7. In contours is the geopotential at 500hPa, anomalies are shaded. We report for reference the two Weather Regimes. In the top right corner you can see the number of ensemble members belonging to the scenarios. 5 S4 S5

Verification & spread Large-scale fixed climatological regimes
verifying analysis

Verification & spread EPS scenarios z500 t+360 Spread
BSS T+96 T+168 T+240 T+360 NAO+ 0.55 0.39 0.26 0.24 NAO- 0.88 0.71 0.47 0.28 BLO 0.98 0.97 0.81 0.61 ATL R 0.74 0.40 0.12 EPS scenarios z500 t+360 Spread This is a similar time serie but for the EPS scenarios computed at day 15. It is interesting to see that spread is much larger consistent with the fact that the scenarios for each day present different attributions. The period in February with relatively small spread still show just one cluster mainly attributed to the negative NAO regime as the veryifing analysis. On the Table we have the BSS computed for this period. The attribution of each weather scenarios to the climatological regimes provides an additional measure of the differences between scenarios in term of large scale flow

Verification: Continuous Ranked Probability Score (CRPS)
Scenario distribution Full EPS (50 members) Reduced EPS Ensemble Mean . The Continuous Ranked Probability Score (CRPS) evaluates the performance of the EPS as a probabilistic forecast; it is negatively oriented so smaller values indicate better forecast performance. The CRPS computed for a deterministic forecast is the same as the mean absolute error. This figure compares the CRPS of the probabilistic forecast based on the full 51-member EPS distribution with the CRPS using probabilities derived from the number of members in each EPS cluster (this is referred to as the ‘scenario distribution’). Also shown are the CRPS values for five reduced size ensembles with a maximum of six members and for the ensemble mean. The scenario distribution is constructed by assuming that each scenario represents perfectly all the other members belonging to that cluster. Consider a case with three clusters with populations of 11, 22 and 18 members. The scenario distribution is then computed taking 11 times scenario 1, 22 times scenario 2 and 18 times scenario 3. The additional reduced size ensembles are constructed by extracting each day a number of random members equal to the number of clusters obtained for that day. The performance of the scenario distribution is significantly better than that of any random reduced size ensembles. This indicates that the cluster scenarios are better at representing the whole EPS distribution. The results for the ensemble mean, being undistinguishable from the reduced ensemble with six members, indicates that a randomly chosen ensemble with maximum ensemble size of six has a CRPS equivalent to that of the ensemble mean.

Verification: last season
Large-scale fixed climatological regimes EPS scenarios z500 t+168 (7 days) 1Dec Apr2013 01Dec2012 01Jan2013 01Feb2013 01Mar2013 01Apr2013

Predictability of Atlantic weather regimes from medium to extended forecast ranges
ODRD Meeting March 2012 Laura Ferranti

Anomaly correlations of forecasts initiated with the 4 Atlantic regimes:
All forecast in the seasons Nov- April Nov-March 2012 Considering all the forecast in the extended season Nov to April for the most recent 5 years, I have computed some scores stratifying the forecasts according to their initial conditions. The ACC shows that between day 9 to day 13 the Forecasts initiated in a blocking or Atl.Ridge flow-type have less skill then forecasts initiated in NAO-. By day 15 forecast initiated in blocking have the lowest acc and forecast initiated in NAO- appear to be the most skilful.

RMSE/sda 1-4 5-8 9-12 13-16 >17days Persistence distribution
If we look at the RMSE/standard deviation of the analysis consistent with the ACC we can see that the forecast initiated with NAO- have smaller RMSE than the ones initiated in blocking conditions. The results so far are consistent with the findings of Mio Matsueda in his analysis of TIGGE data. I need to assess the significance of these results.

Brier Skill Scores : Forecast Range All regimes NAO+ Blocking NAO-
Atl. Ridge 1-7 days 0.88 0.92 0.91 0.85 0.82 5-11 days 0.35 0.25 0.42 0.20 0.39 12-18 days 0.09 (-0.06) 0.21 (0.03) 0.0 (0.13) 0.15 (-0.17) -0.11 (-0.51) 19-25 days -0.03 -0.16 -0.11 0.0 26-32 days -0.04 -0.20 1-30 days (month 1) 0.23 (S3 0.13) 0.32 (S3 0.26) 0.17 (S3 0.05) (S3 0.25) 0.05 (S ) 31-60 days (month 2) -0.07 -0.05

Summary The revised cluster product provides the users with a set of weather scenarios that appropriately represent the ensemble distribution The classification of each EPS scenario in terms of pre-defined climatological regimes provides an objective measure of the differences between scenarios in terms of large-scale flow patterns. This attribution enables flow- dependent verification and a more systematic analysis of EPS performance in predicting regimes transitions The accuracy of the product can be quantified and the use of climatological weather regimes allows flow dependent skill measures This clustering tool can be used to create EPS clusters tailored to the users’ needs (e.g. different domain, different variables) It is important to mention that the probabilistic scores depend largely on the ensemble size. The smaller the ensemble size, the more sensitive are the scores. For ensembles larger than 20–25 members the probabilistic scores start to be less responsive to the ensemble size. A large ensemble provides a more detailed and more reliable estimate of the forecast distribution. So it is not surprising that the full EPS provides a better probabilistic forecast than using just the scenarios. However the clustering products represent a compromise between the advantages of condensing forecast information using a few EPS scenarios against the disadvantage of losing information associated with the full 51 EPS members. The difference in CRPS between the whole EPS distribution and the CRPS of the scenario distribution reminds the users of the extent of such a compromise

Summary There is skill in predicting the Euro-Atlantic weather regimes up to day in the monthly forecast and up to month 1 in the seasonal forecast. Sys4 skill is improved with respect to Sys3 for all the regimes. Flow dependent scores indicate that forecast initiated in blocking conditions have less skill than the forecast initiated in NAO- conditions. Further assessment is in progress. (no much signal at day19-25) From day 9 to day 13

Clustering Techniques and their applications at ECMWF

Similar presentations

Presentation on theme: "Clustering Techniques and their applications at ECMWF"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Techniques and their applications at ECMWF

Similar presentations

Presentation on theme: "Clustering Techniques and their applications at ECMWF"— Presentation transcript:

Similar presentations

About project

Feedback