ESSnet Big Data Dissemination Workshop, Sofia

ESSnet Big Data Dissemination Workshop, Sofia
WP6- Early estimates Boro Nikic ESSnet Big Data Dissemination Workshop, Sofia

WP6-Goals WP 6 (and 7). The aim of this pilot is to investigate multiple big data, administrative and other existing sources in order to produce early estimates for statistical purposes. The project aims for WP 6 (combining sources) at implementing the phases ‘data access’ and ‘data handling’ during the first 12 months of the project. The phases ‘methodology and techniques’ and ‘statistical outputs’ are carried out in the second SGA-period. The exception to this rule is the quick-win on turnover estimates. Partners SI FI NL PL Period Work. Days 100 60 10 40

Work done in period of SGA1 (1)
At the Statistical Office of the Republic of Slovenia several brainstorming meetings of ideas were organized (March 2016) Collaboration with WP7 team Detailing the path of the pilot of now-casting the turnover indicators April 2016) Investigation of sources and methodology for calculation of the Consumer Confidence Indices (April 2016) Searching for the additional ideas for the „quick wins“ (newsfeeds, google trends,…, Maj 2016)

Work done in period of SGA1 (2)
Joint meeting of WP6 and WP7 memebrs in Warsaw (June,2016) Finalising the proposal for SGA2 (November 206) Results of nowcasting experiment (Statistics Finland , January 2017) Results of nowcasting experiment (SURS, January 2017)

Proposed pilot for SGA2 (1)
Title of the pilot: Erly estimates of economic indicators Main economic indicators: Gross domestic product (GDP) Consumer price index (CPI) Retail sale Balance of payments Economic sentiment indictors New leading economic indicators

Proposed pilot for SGA2 (2)
Aim of the pilot: Investigate multiple Big data and other existing sources for purposes of early estimates of at least one of the main economic indicators (partly in SGA1) Create and test the methodology of creating early estimates for at least one of the main economic indicators. Define and test the quality measures which assess quality of the sources, statistical production and statistical results Multinational dimension: Many of the sources are available in most of the countries so it is possible to test them and create the results for more than one country. Even if the country does not have access to any Big data source it is still possible to test methods and processes on administrative and other existing sources.

Big Data sources (2) Big Data Availability (SURS)
Job Vacancies Ads from job portals Yes Traffic loops Social media data (Twitter, Facebook,…) ? Data from supermarket chains Mobile phone data Transaction data from banks

Traffic loops in Slovenia
Around 660 traffic loops 10 categories of vehicles Frequencies of certain type of vehicle are available in 15 min interval Data since 2005 Sample data already at SURS

SURS survey / administrative sources Availability of majority of data
Existing sources (2) SURS survey / administrative sources (monthly) Dissemination Availability of majority of data Business tendencies t-5 Short term statistics (industry, construction, services, trade) t+30-60 t+20-30 Foreign trade t+40 Building permits t+20 (2017) t+5 Demography of enterprises (SBR) t+20-25 VATdata (FURS) t+45 T+20 (rok za oddajo) Wages …

Nowcasting turnover indices
One of the pilots that was started in WP6 Statistics Finland (basic proposal) , Statistical Office of teh Republic of Slovenia Interesting methodological suggestions for estimating early economic indicators → SURS decided to start with this idea Modeling isn‘t new, but is very often used in connection with big data sources. Modeling is very useful for estimations of early economic indicators based on many different data sources

Nowcasting model (1) The idea and a short example of the model came from partners from Statistics Finland). The nowcasting model consists of 2 stages: Principal Component Analysis (PCA) is used to extract principal components from enterprise data. For each enterprise included in the model, time series of data without any missing values is needed. Then, first few principal components are chosen. Linear regression is used: the time series of interest (e.g. GDP) is the dependent variable (Y) and the chosen principal components are the predictors (X1, …, Xn). Seasonal component and other predictors can be added

Nowcasting model (2) Possibilities considered for the model:
Time series of interest: GDP in constant prices (chain-linked volumes, reference year 2010) from 2008Q1 to 2015Q4. 8 different testing spans: the first period is always 2008Q1; the last period is 2014Q1 or 2014Q2 or … or 2015Q4. 3 different sets of enterprise data: D1: turnover in industry; D2: turnover in retail trade; D3: turnover in industry and retail trade (i.e. D1 and D2 together).

Nowcasting model (3) 5 different conditions for choosing enterprise data: Condition Meaning s10 choose only raw enterprise data that are available sooner than 10 days (i.e. until the end of the 9th day) after the end of the last period s20 choose only raw enterprise data that are available sooner than 20 days (i.e. until the end of the 19th day) after the end of the last period s30 choose only raw enterprise data that are available sooner than 30 days (i.e. until the end of the 29th day) after the end of the last period s46 choose only raw enterprise data that are available sooner than 46 days (i.e. until the end of the 45th day) after the end of the last period u choose edited enterprise data

Nowcasting model (4) 11 different conditions for choosing principal components: Condition Meaning Last5 take every p. c., whose eigenvalue's share among all eigenvalues is greater or equal to 5% Po7 take only as many p. c. to have at least 7 cases (time periods) per independent variable later in the linear regression Po8 take only as many p. c. to have at least 8 cases (time periods) per independent variable later in the linear regression po10 take only as many p. c. to have at least 10 cases (time periods) per independent variable later in the linear regression po15 take only as many p. c. to have at least 15 cases (time periods) per independent variable later in the linear regression po20 take only as many p. c. to have at least 20 cases (time periods) per independent variable later in the linear regression 70 take enough p. c. to explain 70% (or a bit more) of the variability of the enterprise data 75 take enough p. c. to explain 75% (or a bit more) of the variability of the enterprise data 80 take enough p. c. to explain 80% (or a bit more) of the variability of the enterprise data 85 take enough p. c. to explain 85% (or a bit more) of the variability of the enterprise data 90 take enough p. c. to explain 90% (or a bit more) of the variability of the enterprise data

Nowcasting model (5) Seasonality can be added as an additional predictor or not Sentiment indicator can be added as an additional predictor or not Alltogether 5280 models are made. Comments on data preparation: Enterprise data are prepared using SAS. The data sets used are a good approximation of the real state. It is impossible for us to get a true state for a certain data set at a certain time in the past, but we can estimate the state well. Since we started using e-questionares (2013M04 in industry, 2014M01 in retail trade), we have the data for some enterprises available only a few days after the end of the reference period. So we are able to get early estimates based on these data.

SURS experiment results (1)
Data: Real turnover of 973 industrial enterprises in period 2008 – 2015 Period P001 P002 … P973 2008M01 3526 214 66519 2008M02 4252 332 36012 2008M03 4111 411 52447 2015M12 5241 412 71025

Data: Real turnover of of 973 industrial enterprises in period 2008 – 2015 (quater = average of quaterly months) Period P001 (Ent1) P002 (Ent2) … P973 (Ent973) 2008Q1 3963 319 2015Q4 5119 422.67 72549

Statistic which is „nowcasted“: GDP at constant prices Period GDP at constant prices 2008Q1 2008Q2 2008Q3 … 2015Q4

Principal component analysis: 8 chosen principal components explain around % of the variablity of enterprise data Linear regression: Y: GDP_CP X1,...,X8: principal components 97% of variability of real GDP_CP index is explained Comperison of GDP_CP and estimates of GDP_CP Avg/Max absolute errors: 43 / 117,2 Avg/Max absolute relative errors: 0,48% / 1,34% Original value 2015Q4: 9401,7 Estimate: 9458 Error: 56,2 Relative error: 0.60% Metoda glavnih komponent: na podatkih od 2008Q1 do 2015Q4. Linearna regresija: - Model: izbranim glavnim komponentam odvzamemo zadnje obdobje, tako da so od 2008Q1 do 2015Q3. - Koeficiente modela uporabimo za izračun napovedi za 2015Q4.

Methods (ongoing work SGA1&SGA2)
Test at least one alternative method for nowcasting of economic indicators Include data from multiple sources (construction, services,...) Test forecasting based on available data Prepare an inventory of nowcasting methods

Early estimates (ongoing work)
Inventory of current practices in other countries/institutions Prepare a list of possible „new leading economic indicators“

IT tools involved in nowcasting of turnover indices
Data preparation Modeling Results STATISTICAL PRODUCTION

IT infrasctructure in SGA2
Sandbox for insensitive big data (e.g. traffic loops data) Internal IT environments for sensitive data

Thank you for your attention! boro.nikic@gov.si

ESSnet Big Data Dissemination Workshop, Sofia

Similar presentations

Presentation on theme: "ESSnet Big Data Dissemination Workshop, Sofia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ESSnet Big Data Dissemination Workshop, Sofia

Similar presentations

Presentation on theme: "ESSnet Big Data Dissemination Workshop, Sofia"— Presentation transcript:

Similar presentations

About project

Feedback