Italian Examples of the use of big data for producing statistics

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

ESTAT International Seminar on Modernizing Official Statistics: Meeting Productivity and New Data Challenges Tianjin, People’s Republic of China
Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,
IDENTIFYING USERS PROFILES FROM MOBILE CALLS HABITS August 12, Beijing, China B Furletti, L. Gabrielli, C. Renso, S. Rinzivillo KddLab, ISTI – CNR,
Combining administrative and survey data: potential benefits and impact on editing and imputation for a structural business survey UNECE Work Session on.
1 The ICT Statistics of Business Sector in China By Lu Haiqi International Statistical Information Center, National Bureau of Statistics People’s Republic.
Carmela Pascucci – Istat - Italy Meeting of the Working Party on International Trade in Goods and Trade in Services Statistics (WPTGS) Linking business.
Skills and occupational needs: the Occupational Information System in Italy Giovanni Castiglioni Università Cattolica del Sacro Cuore - Milano
European Conference on Quality in Official Statistics (Q2010) 4-6 May 2010, Helsinki, Finland Brancato G., Carbini R., Murgia M., Simeoni G. Istat, Italian.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Marina Signore Head of Service “Audit for Quality Istat Assessing Quality through Auditing and Self-Assessment Signore M., Carbini R., D’Orazio M., Brancato.
LIFE THIRD COUNTRIES Development and Implementation of an Integrated System for the Control and Monitoring of the Urban Wastewater Treatment Plants in.
1 26 October 2013 Observation and Reflection on Official Statistics against Big Data Challenge Yuan Pengfei Research Institute of Statistical Sciences.
Evaluating a Research Report
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Supporting Researchers and Institutions in Exploiting Administrative Databases for Statistical Purposes: Istat’s Strategy G. D’Angiolini, P. De Salvo,
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Regional Seminar on Promotion and Utilization of Census Results and on the Revision on the United Nations Principles and Recommendations for Population.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
September 12-14, 2006OECD-Eurostat Expert Meeting1 OECD-Eurostat Expert Meeting on Trade in Services Statistics Foreign Affiliates Statistics in Eurostat.
Big Data activities at SURS Statistical Office of the Republic of Slovenia DIME/ITDG meeting, February 2016.
11 September 2008 Expert group meeting on the scope and content of Social Statistics 1 The Development of Social Statistics in the European Statistical.
June 2009 Regulation on pesticide statistics Pierre NADIN ESTAT E1- Farms, agro-environment and rural development
1 Recent developments in quality related matters in the ESS High level seminar for Eastern Europe, Caucasus and Central Asia countries Claudia Junker,
New data sources (such as Big Data) and Traditional Sources Work Package 2.
Business Models for Mobile Big Data PARTNERSHIPS FOR ACCESS TO THE DATA.
Web Scraping for Collecting Price Data: Are We Doing It Right?
Data mining in web applications
Data Science in Official Statistics: The Big Data Team
Sharing of previous experiences on scraping Istat’s experience
EDUCAUSE Annual Conference
Short Training Course on Agricultural Cost of Production Statistics
Chapter 3: Cost Estimation Techniques
WEB SCRAPING FOR JOB STATISTICS
Mobility of Italian citizens in EU and Efta countries
Description of national ongoing/intended data processing
Istituto Nazionale di Statistica – Istat
13th Governing Council 4th and 5th December,2017 Chiba, Japan
Business and Management Research
United Nations Development Account 10th Tranche Statistics and Data
Big Data ESSNet: Web Scraping for Job Vacancy Statistics Nigel Swier UK Office for National Statistics.
Big Data Econometrics: Nowcasting and Early Estimates
Scanning the environment: The global perspective on the integration of non-traditional data sources, administrative data and geospatial information Sub-regional.
Smart Tourism statistics: improving the range of service offering in Rome Massimo De Cubellis Istat -Italy.
Uses of web scraping for official statistics
New sources for the SBR: first evaluations on the feasibility
iSRD Spam Review Detection with Imbalanced Data Distributions
Use of Web scraping for Enterprises Characteristics
WP7 – COMBINING BIG DATA - STATISTICAL DOMAINS
Business and Management Research
Big Data ESSNet WP 1: Web scraping / Job Vacancies Pilot
Fabio Crescenzi Territorial Databases and Gis
Mr. Alper GÜCÜMENGİL Head of Projects Group, TURKSTAT
Palestinian Central Bureau of Statistics
Expert Group on Quality of Life Indicators
ICT Market Follow up in Morocco Market Observatory/ANRT MOROCCO
ETS Working Group: January 2006 Item 10
Sampling and estimation
Parallel Session: BR maintenance Quality in maintenance of a BR:
Reduction of administrative burden through official statistics
The role of metadata in census data dissemination
Functional geographies through the package LabourMarketAreas
STATISTICS derived from the Latin word STATUS, Italian word STATISTA, German word STATISTIK, and French word STATISTIQUE which express one meaning “ Political.
Big Data in Official Statistics: Generalities
ESS conceptual standards for quality reporting
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Italian Examples of the use of big data for producing statistics Monica Scannapieco THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

Istat Big Data Strategy - 1 Istat (The Italian National Institute of Statistics) set up a Technical Commission with the objective to orient investments on Big Data adoption in statistical production processes Duration: from February 2013 to February 2015 Members coming from different areas: Official Statistics, Academy, Private Sector Eurostat

Istat Big Data Strategy - 2 The Commission released a Roadmap for Big Data adoption as a result of a mixed approach that combined: Top-down phase: analysis of the state of the art of Big Data research and practice Bottom-up phase: experimentations R o a d m a p Eurostat

Istat Big Data Strategy - 3 A new technical Commission has been set up since January 2016, with the (main) objective to monitor the roadmap implementation Eurostat

Roadmap Short Term Actions - 1 Which use Roadmap Short Term Actions - 1 Possible use of Big data sources in OS: by itself in combination with more traditional data sources such as sample surveys and administrative registers Short term use Eurostat

Roadmap Short Term Actions - 2 Which use Roadmap Short Term Actions - 2 Finalization to production: Source type Domain(s) Online Search Data Labour Force statistics Internet-scraped Data ICT usage and Price statistics Mobile Phone Data Mobility and Tourism statistics Scanner data Price statistics Eurostat

Roadmap Short Term Actions - 3 Which use Roadmap Short Term Actions - 3 Laboratory to deal with other source types Source type Domain(s) Social Media Social statistics (e.g. Consumer Confidence) Images: Traffic Webcams & Orthoimages Traffic and Agriculture statistics Eurostat

Examples of experiences so far ICT Usage in Enterprises based on Internet as a Data Source (IaD) Persons and Places based on Mobile Phone Data Eurostat

ICT Usage in Enterprises Purpose: Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) Methodology Scraping of web sites for data extraction Supervised classification task Eurostat

The “ICT in enterprises” survey In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011) the survey, 8,687 (45% of sampling In the 2013 round of indicated their website respondent units) The access to the indicated websites in order to gather information directly within them, gives different opportunities Eurostat

Particular effort was dedicated to question B8a (“Web The web questionnaire is used to collect information on the characteristics of the websites owned or used by the enterprises: Objective: predict values of questions from B8a to B8g using machine learning techniques applied to texts (text mining) scraped from the websites. Particular effort was dedicated to question B8a (“Web sales facilities” or “e-commerce”) Eurostat

The overall methodology 2013 and 2014 rounds of the survey have both been used in the experiment. Phase 1- Web scraping: For all respondents declaring to own a website, their website have been scraped, Phase 2 – Estimation: Texts collected in phase 1 were submitted to classical text mining procedures in order to build a term/document matrix Learners: to predict values of target variables (for instance, “e-commerce (yes/no)”) on the basis or relevant terms individuated in the websites Eurostat

Phase 1: Web Scraping So far, three different solutions investigated: the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling, content extraction, indexing and searching; HTTrack (http://www.httrack.com/), a free and open source software tool that permits to “mirror” locally a web site, by downloading each page that composes its structure; JSOUP (http://jsoup.org) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system (http://adamsoft.sourceforge.net). Currently developing ad-hoc JSOUP based solutions Eurostat

ADaMSoft compressed binary files Solution # websites reached Average number of cwebpages per site Time spent Type of Storage dimensions Nutch 7020 / 8550=82,1% 15,2 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 HTML ADaMSoft compressed binary files 500MB Phas e 1: web s rapin g Eurostat

Phase 2: Estimation 2013 data have been used as “train” dataset, while 2014 data have been used as “test” dataset The performance of each learner has been evaluated by means of the usual quality indicators: accuracy: rate of correctly classified cases on the total; sensitivity: rate of correctly classified positive cases on total positive cases; specificity: rate of correctly classified negative cases on total negative cases. Eurostat

Quality Indicators Learner 0.69 0.68 0.19 0.22 0.79 0.63 0.83 0.25 Accuracy Sensitivity Specificity Proportion of e-commerce (observed) Proportion of e- commerce (predicted) GLM (Logistic) 0.69 0.68 0.19 0.22 Random Forest 0.79 0.63 0.83 0.25 Neural Network 0.70 0.62 0.72 0.20 Boosting 0.67 0.66 Bagging 0.82 0.38 0.92 Naïve Bayes 0.75 0.55 0.21 LDA 0.71 0.65 0.28 RPART (Tree) 0.95 0.16 Eurostat

Eurostat

Conclusions for the ICT Usage in Enterprises Project So far, the pilot explored the possibility to replicate the information collected by the questionnaire using the scraped content of the website and applying the best predictor (scenario 1  reduction of respondent burden) A more relevant possibility is to combine survey data and Big Data (scenario 2) in order to improve the quality of the estimates Eurostat

Conclusions for the ICT Usage in Enterprises Project The aim is to adopt a full predictive approach with a combined use of data: all the websites owned by the whole population of enterprises are identified and their content collected by web scraping (= Big Data); survey data (the “truth ground”) are combined with Big data in order to establish relations (models) between the values of target variables and the terms collected in corresponding scraped websites; estimated models obtained in step 2 are applied to the whole set of texts obtained in step 1 in order to produce estimates related to the target variables. Eurostat

The Persons and Places Project Purpose Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data Actors involved in the project Istat National Research Council University of Pisa Methodology Inference of population mobility profiles from GSM Call Detail Records (CDRs) Comparison with data derived from administrative sources Eurostat

Data CDR (Wind, province of Pisa, october 2011) Admninistrative data (P&P, province of Pisa, december 2011) Eurostat

Methodology Data Extraction Aggregation Classification Statistics CDR Data Extraction Aggregation Risk evaluati on Classification Statistics Interpretation Admin Data Validation Eurostat

Aggregation: Individual Call Profiles The temporal aggregation is by week, where each day of a given week is grouped in weekdays and weekend Given for example a temporal window of 28 days (4 weeks), the resulting matrix has 8 columns (2 columns for each week, one for the weekdays and one for the weekend) A further temporal partitioning is applied to the daily hours. A day is divided in several timeslots, representing interesting times of the day Eurostat

Classification Profile Classification, i.e. the attribution of ICPs to the proper class was performed into two steps: Extraction of representative call profiles, i.e. a relatively small set of synthetic call profiles, each summarizing an homogeneous set of (real) ICPs This step reduces the set of samples to be manually classified. The labels assigned to the representative profiles are propagated to the full set of ICPs Eurostat

Classification The mean values of the ICPs belonging to each cluster serves as prototype / representative of the cluster The choice of the parameter K, equal to 100, was made by performing a wide range of experiments, trying to minimize the intra-cluster distance and maximizing the inter-cluster distance Once extracted the representatives (RCPs), they have been labeled by domain experts in the identified Profile Classes Eurostat

Classification The second step, i.e. the propagation of the labels manually assigned to the RCPs, followed a standard 1-Nearest-Neighbor (1-NN) classification step. That corresponds to assign to each ICP the label of the closest RCP Eurostat

Individual call profile Resident Individual call profile A Dynamic resident B A Commuters Classification algorithm A A B Visitors Eurostat

A flow from A ->B defined by dynamic resident in B that work in A (commuters) Commuter Dynamic Resident B A Eurostat

Comparison of estimations made starting from CDRs wrt Admin Data GSM rescaled considering the market share of the operator Eurostat

Commuters (inbound flow) Eurostat

Dynamic resident (outboun d flow) Eurostat

Eurostat

Inbound commuters in Pisa Eurostat

Inbound commuters in Pisa Eurostat Inbound commuters in Pisa

Outbound commuters in Pisa Eurostat Outbound commuters in Pisa

Conclusions for the Persons and Places Project Semi-automatic methodology for estimation of population flows Good alignment with administrative data results First steps towards usage of mobile phone data for OS Eurostat

ICT Usage in Enterprises: Recommendations from experimentations - 1 ICT Usage in Enterprises: Even unstructured data can be harnessed by OS. Very promising preliminary results in terms of quality of the estimates wrt questionnaire-based estimates Dedicated IT infrastructure for (i) scraping and (ii) scaling up Eurostat

Recommendations from experimentations - 2 Persons and Places: Privacy issues in dealing with mobile phone data. First positive solutions by Italian «Garante per la Privacy» Comparison with administrative data suggests reliability of mobile phone data estimaton (though still necessary to work for ensuring OS quality levels) Eurostat

References Persons and Places: Furletti, B., Gabrielli, L., Garofalo,G., Giannotti,F., Milli, L., Nanni,M., Pedreschi,D., Vivio, R.: Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach. SIS, Cagliari, 2014. Labour Market Estimation: Bacchini, F. , D’Alò, M., Falorsi,S., Fasulo, A., Pappalardo,A.: Does Google index improve the forecast of Italian labour market? SIS, Cagliari, 2014. ICT Usage: Barcaroli, G., Scannapieco, M., Nurra, A, Scarnò,M., Salamone, S., Summa, D.: Internet as Data Source in Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Vol44, no 2, 2015. Analyses techniques: Barcaroli G., De Francisci S., Scannapieco M., Big Data Analysis: Experiences and Best Practices in Official Statistics, Conference of European Statistical Stakeholders, Rome, 2014. IT issues: Barcaroli G., De Francisci S., Scannapieco M., Summa D.: Dealing with Big Data for Official Statistics: IT Issues; MSIS, Dublin, 2014 Introductory: Scannapieco M., Virgillito A., Zardetto D. : Placing Big Data in Official Statistics: A Big Challenge?, NTTS, Brussels, 2013. Eurostat