New sources for the SBR: first evaluations on the feasibility

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
Introduction to Systems Analysis and Design
Information System for Quality Documentation A Short Presentation for the ESTP Course “Data Dissemination and Publication of Statistics” by Sonia Vittozzi.
Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Combining administrative and survey data: potential benefits and impact on editing and imputation for a structural business survey UNECE Work Session on.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Carmela Pascucci – Istat - Italy Meeting of the Working Party on International Trade in Goods and Trade in Services Statistics (WPTGS) Linking business.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Population Census carried out in Armenia in 2011 as an example of the Generic Statistical Business Process Model Anahit Safyan Member of the State Council.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
>>. ESSnet Measuring Global Value Chains 1.Globalisation indicators 2.Methodological development and support for International Organisation and Sourcing.
Supporting Researchers and Institutions in Exploiting Administrative Databases for Statistical Purposes: Istat’s Strategy G. D’Angiolini, P. De Salvo,
Guidelines on Statistical Business Registers Draft Chapter 8: Quality of SBR Caterina Viviano, Monica Consalvi ISTAT Meeting of the Group of Experts on.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Developing and applying business process models in practice Statistics Norway Jenny Linnerud and Anne Gro Hustoft.
1 For a Population Statistical Register Characteristics and Potentials for the Official Statistics Central department for administrative data and archives.
Experience and response in developing countries: the twinning project with the Tunisian National Statistical Institute Monica Consalvi ISTAT, Division.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
Big Data activities at SURS Statistical Office of the Republic of Slovenia DIME/ITDG meeting, February 2016.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
ESSnet on the harmonisation and implementation of a European socio- economic classification Workpackage 2 – Expertise of the basic variables ___________________.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
4-6 September 2013, Vilnius Quality in Statistics: Administrative Data and Official Statistics USING ADMINISTRATIVE DATA SOURCES IN OFFICIAL.
What we mean by Big Data and Advanced Analytics
Chapter 11 Managing Knowledge.
The Development of Statistical Business Registers in
Efficiency and generalization as drivers
Sharing of previous experiences on scraping Istat’s experience
UNECE Data Integration Project
WEB SCRAPING FOR JOB STATISTICS
Istituto Nazionale di Statistica – Istat
“The infrastructure for the SBS-Frame production in ISTAT”
Chapter 11 Managing Knowledge.
Implementation of a more efficient way of collecting data SBS: use of administrative data Statistics Belgium June 2009.
Data collection with Internet
Quality Aspects and Approaches in Business Statistics
Survey phases, survey errors and quality control system
Sample surveys versus business register evaluations:
Survey phases, survey errors and quality control system
ESSnet Projects Pascal JACQUES Unit/B5 Methodology and research
Changed Data Collection Strategies
2. An overview of SDMX (What is SDMX? Part I)
Scanning the environment: The global perspective on the integration of non-traditional data sources, administrative data and geospatial information Sub-regional.
Italian situation in the following areas:
Uses of web scraping for official statistics
2. An overview of SDMX (What is SDMX? Part I)
Use of Web scraping for Enterprises Characteristics
Web Mining Department of Computer Science and Engg.
Big Data ESSNet WP 1: Web scraping / Job Vacancies Pilot
Fabio Crescenzi Territorial Databases and Gis
Daliutė Kavaliauskienė Statistics Lithuania
Data collection with Internet
ISTAT automatic profiling to support the delineation of
in cooperation with the MNE:
Istat - Structural Business Statistics
Multinational enterprise groups in the EU Dissemination from the EGR
Mapping Data Production Processes to the GSBPM
Parallel Session: BR maintenance Quality in maintenance of a BR:
Conference on New Technologies for official Statistics
how users and data producers interact on WIS
Data collection with Internet
Silvia Losco, ISTAT, Strategies and approaches for managing risks in the official statistics production: ISTAT experience in the.
Data collection with Internet
Big Data in Official Statistics: Generalities
Presentation transcript:

New sources for the SBR: first evaluations on the feasibility 26th Meeting of the Wiesbaden Group on Business Registers Session 5 New data Sources Neuchâtel (Switzerland) - 24-27 September 2018 New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Authors: Bianchi G., Consalvi M., Gentili B., Pancella F., Scalfati, Summa D. Speaker: Monica Consalvi Directorate for Economic Statistics - Istat, Italy

Contents The Istat strategy from 2013 The scope of innovation and objectives of the BIGData project Main steps of the first experimentation in the LabInn Process model First preliminary results Lessons learnt and future developments New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

The Istat strategy from 2013 On-going projects on the use of Big Data: Persons and Places (Mobile Phone Data) Labour Market Estimation (Google Trends) On the use of ‘Scanner Data’ for estimating consumer prices Web-scraping techniques for the estimation of the use of Information and Communications Technology by companies and social media (Twitter, Facebook) for the estimation of confidence indicators. Production of experimental statistics: https://www.istat.it/en/experimental-statistics/experiments-on-big-data to obtain a subset of the estimates currently produced by the sampling survey on yearly survey “ICT usage in Enterprises”. Implementation of infrastructures and software for the treatment of Big Data (i.e. Sandbox and Cloudera) with projects involving enterprises, universities and research centres. New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

The Istat strategy from 2013 Three ESSnet on BigData – in the 3rd BIG DATA 2018-2020 Italy participates in 9 WPs In particular WP C – Implementation – Enterprise characteristics: Task 2 Methodological Framework/Guidelines Task 3 Experimental Statistics, including reference metadata Task 4 Starter Kit for NSIs Mapping activity Theoretical matrix ‘thematic areas–domains/phenomena’ Empirical matrix ‘thematic areas–keywords’ In the last part of 2017 a preliminary analysis started on the feasibility of the acquisition and the possible use of web data for the SBR The new “Big Data usage to support the SBR” has been the first project to participate in the Laboratory for Innovation (LabInn) inaugurated on the 21st March 2018 New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

The scope of innovation and objectives of the BigData project The use of BIG DATA in support of the Statistical Business Register with the aim of integrating the ‘structured’ business data with the ‘unstructured’ data from web pages The prototype will contain a set of enterprise-level information linked to the SBR, whose content allows: Dissemination of a new experimental statistical output expanding the Statistical Business Register Target 1: Increase quality of information in the SBR for improvement of sample designs Target 2: Expanding informative content of the SBR (new items, documentation system) Target 3: New business taxonomies (‘emerging’ populations, to be mapped over time) The production of experimental statistics in addition to the traditional business statistics New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Main steps of the first experimentation 1 – Acquisition of web addresses URL from the admin sources (available only for 5% of SBR enterprises) URL from thematic directory sites URL from batch queries on search engines (URL Retrieval with machine learning techniques) 2 – Identification of the enterprise URL syntax validation, check of the recurring errors and domain extraction from URL Detection of identification variables from the website (Fiscal code, VAT code, name, address…) trough the use of Information Retrieval techniques trough pattern matching on strings Comparison with the same information available in the SBR register through matching techniques and string similarity metrics (Jaro-Winkler, Levenshtein, etc) 3 – Extraction and analysis of the information Web Scraping techniques for web data acquisition Text Mining techniques (including Natural Language Processing techniques) for extracting the information to integrate the Business Register (including metadata and .pdf file) Machine Learning techniques for the use of algorithms that simulate a learning process for the construction of predictive models New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

WHICH ENTERPRISES HAVE BEEN PART OF THE EXPERIMENTATION? SBR Reference population sample by economic activity: ≈ 4.5 mln enterprises in t=2015 9 classes of the high-level SNA/ISIC aggregation A*10/11 of the NACE Rev.2 classification ≈ 1.6 mln enterprises active in t=2015 Legal form = corporation/partnership by size: Size of the enterprises in terms of employees and turnover stratified by: = 115,751 enterprises Stratified sample with proportionate allocation (36 sample strata) New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Process model Web Scraping phase Machine Learning Text mining phase New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

First preliminary results: the Web Mining process Enterprises in the sample by identified characters and by source of information for the URL CHARACTERS Total URLs available from admin source URLs from certified mails URLs from web portals URLs from search engine Company name 100,000 64 % 14 % 5 % 17 % Tax code/VAT number 80,440 60 % 9 % 26 % Enterprise street address 103,000 70 % 8 % 7 % 15 % E-mail Address 198,000 10 % 3 % 27 % Telephone number 230,000 63 % 22 % Company capital 21,580 40 % 4 % 35 % 21 % Presence on Social Media 131,830 65 % 6 % 20 % Presence of online job application facilities 15,000 52 % 16 % 2 % 30 % how the enterprises of the sample achieved the URLs: New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

First preliminary results Enterprises by presence on social media Addresses by type of local unit Type of local unit % Addresses related to administrative headquarters 70 Addresses related to other local units 30 E-mail addresses by type of address Type of address % E-mail address 92 Certified e-mail address 8 Phone numbers by type Type of phone number % Phone 58 Mobile phone 12 Fax 30 Online job facilities % Enterprises WITH online job application facilities 15 Enterprises WITHOUT online job application facilities 85 New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Lessons learnt and future developments Traditional systems of data collection and data processing no longer adequate Difficulty to manage rapidly growing volumes of data resulting in a high consumption of computing and storage resources Other technical limits to solve: the long run time necessary for the crawler to get the entire content; the restrictions caused by the security barriers inside the website preventing automatic access Statistical problems: the difficulty of certifying the quality of information and the data reliability; how to attribute information with certainty to an SBR enterprise; how to classify with certainty the universe to which a set of data extracted from the web belongs New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Lessons learnt and future developments enriching statistical production with new information; increasing the timeliness of statistical products in a short time; increasing the relevance of business statistics at a lower cost than enlarging the existing data; the web is an independent source of data  representation closer to reality. New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Lessons learnt and future developments New opportunities involve new challenges and still open questions: when and how it would be necessary to repeat the extraction of the same data from the web (cyclic monitoring to capture the differences?); when differences can be considered significant as true changes; how to get the “date” in which the change takes place; respect for privacy; the need to acquire new infrastructures (methodological, technological, organizational ones), as well as new skills; the main challenge is to move from experimentation to production. New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

The new integrated system of ASIA registers Administrative Sources ASIA Local Units Financial Statement and note FARM REGISTER Fiscal sources Chamber Commerce Social Security Sectoral, Public Administration Big Data  ….. Statistical Sources The Business Portal: ASIA Enterprises Microdata integration process PA ASIA Enterprise Groups Non PROFIT S13 ASIA Employment EGR New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Output oriented approach Lessons learnt and future developments INPUT Big data DATA DRIVEN approach OUTPUT Standard traditional processes of Official Statistics Output oriented approach The challenge: A register-based approach to the analysis of BIG DATA New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Methodological expert: Factors for Success: the multi-dipliscinary context High interaction among experts (they work contextually, side by side) It requires managing connections among separated departments Integration of different skills and expertise Shared competence base (a good training is essential) Thematic expert: governs the process, raises the problem and indicates its requirements, validates the results. He/she does not know in advance the contents coming from the web Methodological expert: Analyses the requirements of the problem and finds the methodological solutions IT expert: provides the tools and techniques and turns specifications into procedures New sources for the SBR: first evaluations on the feasibility of using Big Data in the SBR production process Neuchâtel (Switzerland) - 24-27 September 2018

Monica Consalvi (moconsal@istat.it) Barbara Gentili (bagentil@istat.it) Flavio Pancella (pancella@istat.it) Gianpiero Bianchi (gianbia@istat.it) Francesco Scalfati (scalfati@istat.it) Donato Summa (donato.summa@istat.it)