Session D7: Big Data Analysis from Classification to Dimensional reduction The curse of dimensionality in official statistics Emanuele Baldacci, emanuele.baldacci@ec.europa.eu.

Slides:



Advertisements
Similar presentations
Will ‘big data’ transform official statistics?
Advertisements

Barteld Braaksma and Kees Zeelenberg “Re-make / Re-model”: Should big data change the modelling paradigm in official statistics?
Machine Learning CMPT 726 Simon Fraser University
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
Enterprise systems infrastructure and architecture DT211 4
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
Part 1: Introduction 1-1/22 Econometrics I Professor William Greene Stern School of Business Department of Economics.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
PATTERN RECOGNITION AND MACHINE LEARNING
Big Data Activities at Eurostat Workshop on Statistical Data Collection, 29 Apr – 1 May 2015, Washington D.C, USA
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
LECTURE 1 - SCOPE, OBJECTIVES AND METHODS OF DISCIPLINE "ECONOMETRICS"
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Syllabus. We covered Regression in Applied Stats. We will review Regression and cover Time Series and Principle Components Analysis. Reference Book.
Eurostat – Unit D5 Key indicators for European policies Third International Seminar on Early Warning and Business Cycle Indicators Annotated outline of.
Experience and response in developing countries: the twinning project with the Tunisian National Statistical Institute Monica Consalvi ISTAT, Division.
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Report on the breakout session on Rapid Estimates Roberto Barcellan European Commission - Eurostat.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Big Data activities at SURS Statistical Office of the Republic of Slovenia DIME/ITDG meeting, February 2016.
Overview of Programme of the Working Group on Flash Estimates of GDP Roberto Barcellan European Commission - Eurostat.
United Nations Statistics Division Overview of handbook on rapid estimates Expert Group Meeting on Short-Term Economic Statistics in Western Asia
Eurostat – Unit D1 Key indicators for the European policies Euro-indicators Working Group Luxembourg, 4 th & 5 th December 2008.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
Session topic (i) – Editing Administrative and Census data Discussants Orietta Luzi and Heather Wagstaff UNECE Worksession on Statistical Data Editing.
Introduction to Machine Learning, its potential usage in network area,
New Techniques and Technologies for Statistics Brussels, March 2017
Data Science in Official Statistics: The Big Data Team
Sharing of previous experiences on scraping Istat’s experience
New Techniques and Technologies for Statistics Brussels, March 2017
WEB SCRAPING FOR JOB STATISTICS
Conference of European Statistics Stakeholders October 2016
Fabrice Murtin OECD Statistics Directorate CESS 2016, Budapest
Towards more flexibility in responding to users’ needs
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
New approaches for data collection and analyses
Herman Smith United Nations Statistics Division
Carsten Boldsen Hansen Economic Statistics Section, UNECE
Session D12: Multisource statistics New sources: new modelling approaches Author: Gras Fabrice, Eurostat, unit B1, Methodology and corporate architecture.
Source: Procedia Computer Science(2015)70:
Innovation in statistical processes and products: a European view
Steering Group Admin Project, 12 May 2016
Data Mining: Concepts and Techniques Course Outline
Vincent Granville, Ph.D. Co-Founder, DSC
The future of the LMAs from the Commission's perspective
Outlier Discovery/Anomaly Detection
Henri Luomaranta, Statistics Finland
Estimation methods for the integration of administrative sources
L. Isella, A. Karvounaraki (JRC) D. Karlis (AUEB)
Big Data Econometrics: Nowcasting and Early Estimates
Methodology and Corporate Architecture
New Techniques and Technologies for Statistics Brussels, March 2017
New ways to get the data Multiple mode and big data
Dissemination Workshop ESSnet Big Data Sofia, February 2017
Big Data and Nowcasting
ESTP programme for 2016 Živilė Aleksonytė-Cormier
UN/Eurostat Handbook on Rapid Estimates: Status and way forward
ESS.VIP ADMIN Sorina Vâju.
United Nations Statistics Division
Use of Web scraping for Enterprises Characteristics
Chengyaun yin School of Mathematics SHUFE
Big Data ESSNet WP 1: Web scraping / Job Vacancies Pilot
ESS.VIP ADMIN EssNet on Quality in Multi-source Statistics, progress report 19TH WORKING GROUP ON QUALITY IN STATISTICS, 6 December 2016 Fabrice Gras,
Jong-Min Kim1 and Hojin Jung2
COMmunicating UNcertainty In Key Official Statistics
Big Data in Official Statistics: Generalities
ESS conceptual standards for quality reporting
Presentation transcript:

Session D7: Big Data Analysis from Classification to Dimensional reduction The curse of dimensionality in official statistics Emanuele Baldacci, emanuele.baldacci@ec.europa.eu Eurostat Director, Directorate B Methodology, Corporate statistical and IT services Dario Buono, dario.buono@ec.europa.eu Eurostat, Unit B.1: Methodology and corporate architecture Fabrice Gras, fabrice.gras@ec.europa.eu Eurostat, Unit B.1: Methodology and corporate architecture Conference of European Statistics Stakeholders Budapest, 20–21 October 2016

The curse of dimensionality (coined by Richard E. Bellman in 1961) When the dimensionality increases, the volume of the space increases so fast that the available data become sparse. To obtain a statistically significant result, the amount of data needed often grows exponentially with the dimensionality.

Big Data, Huge Dimensions… Sparse Activities Dimensionality Big Data and Macroeconomic Nowcasting & Econometrics Selectivity methods Mobile phone data What's next?

Dealing with dimensionality in official statistics   Multiple sources: towards Model Based statistics Type Huge number of time series High frequency time series Huge number of dimensions Problem Reduction of dimensionality, data snooping Extraction/decomposition of signal for high frequency data, mixed frequency Curse of dimensionality (sampling, distance functions) Aim Early estimate, nowcasting, classification Nowcasting, Data filtering and signal extraction of high frequency time series Data mining: machine learning, clustering, classification Possible methods Shrinkage models, Factor model, Bayesian model, regression trees, panel modelling Wavelet, ensemble mode decomposition, outliers detection, and extreme events theory, state space modelling, (U)-MIDAS Bayesian inference, alternative distance, state space models

Dimensionality challenges Data access, storage and dissemination Data analytics Moving towards more model based statistics while preserving robustness and quality of existing official statistics NSIs actually need to pay more and more in the future attention to the "curse of dimensionality"

Data storage: possible solution is Data Virtualisation

Data analytics: the way to go Use of all the informational content included in models. Model based statistics: trade-off between robustness and precision properties of model based statistics. Assessment of scenario based on estimation of density functions. Presentation of indicators based on clustering of some contextual variables.

The curse of dimensionality & Data Modelling Data snooping: among an infinite number of candidate models, presence of a winner Distance: assessment of the distance relevancy in high dimensional space, use of Bayesian inference, embedding dimension of a problem (Taken's theorem). High frequency data: at which frequency the signal is the most relevant Data mining for selecting regressors

Eurostat (Sparse?) activities Big Data Macroeconomic Nowcasting, 2016 Big Data Econometrics, 2017 Selectivity in Big Data sources, ongoing "Assessing the Quality of Mobile Phone Data as a Source of Statistics", Q2016 joint-paper by Statistics Belgium, Eurostat and Proximus

Big Data Macroeconomic Nowcasting Literature review on the use of Big Data for macro-economic nowcasting Use of a typology based on Doornik and Hendry (2015): Tall data: many observation, few variables Fat data: many variables, few observations Huge data: many variables, many observations

Models race Dynamic Factor Analysis Partial Least Squares Bayesian Regression LASSO regression U-Midas models Model averaging 255 models tested using macro-financial and google trend data

Statistical Methods: findings Sparse regression (LASSO) works for fat, huge data Data reduction techniques (PLS) helpful for large variables (U)-MIDAS or bridge modelling for mixed frequency Dimensionality reduction improves nowcasting Forecast combination: Data-driven automated strategy with model rotation based on forecasting performance in the past works well

Follow-up: Big Data Econometrics Review of methods to move from unstructured to structured time-series data sets for various types of big data sources including filtering techniques for high frequency data. Propose modelling strategies to be tested. Carry out further empirical tests on possible data timeliness/accuracy gains. Big data handling tool developed as R package. Scientific summary for Big Data Econometric strategy.

Open econometric questions Time Shrinkage model to select variables t0 X5 X123 X215 t1 t2 X13 t3 X25 … tn X3 Informational content of the various selected variables along time? How to choose the time span for estimating the model?

Big Data sources Selectivity: Main Issues Self-selection and the resulting non-probability character of the data. Discrepancies between big data populations and the target population. Identification of statistical units (target population indirectly observed). How to deal with representativeness and coverage of Big Data for sampling purposes.

Big Data sources Selectivity: Proposed methods (so far…) Pseudo-design approach–reweighting (calibration, Pseudo-empirical likelihood, weighting) Modelling approach (M-quantile models, Model based in calibration, Bayesian approach, Machine learning approach) Record linkage New study in 2017 to go further

Example (1): Selected big data sources and basic characteristics that can be derived from the data. Mobile Twitter Google Trends Gender possible limited very limited Age Residence precise Occupation Marital status LFS category Spatial aggregation municipality regional/cities

Example (2): Target population and twitter population

Mobile Phone data: Clustering Time Series (1) Assessing the Quality of Mobile Phone Data as a Source of Statistics http://www.ine.es/q2016/docs/q2016Final00163.pdf Scaling: Standardization Distance measure: Euclidian Applied Technique: K-means Applied Technique: K-means, Euclidian distance after standardisation of time series Objectives: find patterns enabling the classification of geographical areas in work, residential and commuting area

More Eurostat Big Data activities List of pilot projects (Specific Grant Agreement) Web scraping: job vacancies ; enterprise characteristics Smart meters: electricity consumption ; temporary vacant dwellings Automatic Identification System (Ships) Flight Reservation Systems for true Origin/Destination Amadeus Database Aircraft Tracking System Satellite system to receive flight positions from aircrafts Wikipedia as source for statistics: Cultural and regional statistics

What's next European Big Data Hackathon ,15-17 March 2017,Brussels European Statistical Training Courses in 2017

ESTP courses supporting big data (2017) Q3 Big data sources - Web, Social media and text analytics Q1 Q2 Introduction to big data and its tools Hands-on immersion on big data tools Nowcasting Q4 Advanced big data sources - Mobile phone and other sensors Q2 Q2 Q1 The use of R in official statistics: model based estimates Actual date will be known in November 2016 Can a statistician become a data scientist? Time-series econometrics Big data courses Methodology courses Activity

Thank you for your attention Questions welcome References: Clément Marsilli Variable Selection in Predictive MIDAS Models, Document de travail 520, Banque de France, https://www.banque-france.fr/uploads/tx_bdfdocumentstravail/DT-520.pdf Eurostat, Big data and macroeconomic nowcasting, preliminary results presented at the ESS methodological working group (7 April 2016, Luxembourg) http://ec.europa.eu/eurostat/cros/content/item21bigdataandmacroeconomicnowcastingslides_en M. Verleysen, D. François, G. Simon, V. Wertz, On the effects of dimensionality on data analysis with neural networks https://perso.uclouvain.be/michel.verleysen/papers/iwann03mv.pdf Summary Statistics in Approximate Bayesian Computation, Dennis Prangl https://arxiv.org/pdf/1512.05633.pdf Big data CROS portal http://ec.europa.eu/eurostat/cros/content/big-data_en