Presentation is loading. Please wait.

Presentation is loading. Please wait.

Italian Examples of the use of big data for producing statistics

Similar presentations


Presentation on theme: "Italian Examples of the use of big data for producing statistics"— Presentation transcript:

1 Italian Examples of the use of big data for producing statistics
Monica Scannapieco THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

2 Istat Big Data Strategy - 1
Istat (The Italian National Institute of Statistics) set up a Technical Commission with the objective to orient investments on Big Data adoption in statistical production processes Duration: from February 2013 to February 2015 Members coming from different areas: Official Statistics, Academy, Private Sector Eurostat

3 Istat Big Data Strategy - 2
The Commission released a Roadmap for Big Data adoption as a result of a mixed approach that combined: Top-down phase: analysis of the state of the art of Big Data research and practice Bottom-up phase: experimentations R o a d m a p Eurostat

4 Istat Big Data Strategy - 3
A new technical Commission has been set up since January 2016, with the (main) objective to monitor the roadmap implementation Eurostat

5 Roadmap Short Term Actions - 1
Which use Roadmap Short Term Actions - 1 Possible use of Big data sources in OS: by itself in combination with more traditional data sources such as sample surveys and administrative registers Short term use Eurostat

6 Roadmap Short Term Actions - 2
Which use Roadmap Short Term Actions - 2 Finalization to production: Source type Domain(s) Online Search Data Labour Force statistics Internet-scraped Data ICT usage and Price statistics Mobile Phone Data Mobility and Tourism statistics Scanner data Price statistics Eurostat

7 Roadmap Short Term Actions - 3
Which use Roadmap Short Term Actions - 3 Laboratory to deal with other source types Source type Domain(s) Social Media Social statistics (e.g. Consumer Confidence) Images: Traffic Webcams & Orthoimages Traffic and Agriculture statistics Eurostat

8 Examples of experiences so far
ICT Usage in Enterprises based on Internet as a Data Source (IaD) Persons and Places based on Mobile Phone Data Eurostat

9 ICT Usage in Enterprises
Purpose: Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) Methodology Scraping of web sites for data extraction Supervised classification task Eurostat

10 The “ICT in enterprises” survey
In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011) the survey, 8,687 (45% of sampling In the 2013 round of indicated their website respondent units) The access to the indicated websites in order to gather information directly within them, gives different opportunities Eurostat

11 Particular effort was dedicated to question B8a (“Web
The web questionnaire is used to collect information on the characteristics of the websites owned or used by the enterprises: Objective: predict values of questions from B8a to B8g using machine learning techniques applied to texts (text mining) scraped from the websites. Particular effort was dedicated to question B8a (“Web sales facilities” or “e-commerce”) Eurostat

12 The overall methodology
2013 and 2014 rounds of the survey have both been used in the experiment. Phase 1- Web scraping: For all respondents declaring to own a website, their website have been scraped, Phase 2 – Estimation: Texts collected in phase 1 were submitted to classical text mining procedures in order to build a term/document matrix Learners: to predict values of target variables (for instance, “e-commerce (yes/no)”) on the basis or relevant terms individuated in the websites Eurostat

13 Phase 1: Web Scraping So far, three different solutions investigated:
the Apache suite Nutch/Solr ( for crawling, content extraction, indexing and searching; HTTrack ( a free and open source software tool that permits to “mirror” locally a web site, by downloading each page that composes its structure; JSOUP ( permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system ( Currently developing ad-hoc JSOUP based solutions Eurostat

14 ADaMSoft compressed binary files
Solution # websites reached Average number of cwebpages per site Time spent Type of Storage dimensions Nutch 7020 / 8550=82,1% 15,2 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 HTML ADaMSoft compressed binary files 500MB Phas e 1: web s rapin g Eurostat

15 Phase 2: Estimation 2013 data have been used as “train” dataset, while 2014 data have been used as “test” dataset The performance of each learner has been evaluated by means of the usual quality indicators: accuracy: rate of correctly classified cases on the total; sensitivity: rate of correctly classified positive cases on total positive cases; specificity: rate of correctly classified negative cases on total negative cases. Eurostat

16 Quality Indicators Learner 0.69 0.68 0.19 0.22 0.79 0.63 0.83 0.25
Accuracy Sensitivity Specificity Proportion of e-commerce (observed) Proportion of e- commerce (predicted) GLM (Logistic) 0.69 0.68 0.19 0.22 Random Forest 0.79 0.63 0.83 0.25 Neural Network 0.70 0.62 0.72 0.20 Boosting 0.67 0.66 Bagging 0.82 0.38 0.92 Naïve Bayes 0.75 0.55 0.21 LDA 0.71 0.65 0.28 RPART (Tree) 0.95 0.16 Eurostat

17 Eurostat

18 Conclusions for the ICT Usage in Enterprises Project
So far, the pilot explored the possibility to replicate the information collected by the questionnaire using the scraped content of the website and applying the best predictor (scenario 1  reduction of respondent burden) A more relevant possibility is to combine survey data and Big Data (scenario 2) in order to improve the quality of the estimates Eurostat

19 Conclusions for the ICT Usage in Enterprises Project
The aim is to adopt a full predictive approach with a combined use of data: all the websites owned by the whole population of enterprises are identified and their content collected by web scraping (= Big Data); survey data (the “truth ground”) are combined with Big data in order to establish relations (models) between the values of target variables and the terms collected in corresponding scraped websites; estimated models obtained in step 2 are applied to the whole set of texts obtained in step 1 in order to produce estimates related to the target variables. Eurostat

20 The Persons and Places Project
Purpose Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data Actors involved in the project Istat National Research Council University of Pisa Methodology Inference of population mobility profiles from GSM Call Detail Records (CDRs) Comparison with data derived from administrative sources Eurostat

21 Data CDR (Wind, province of Pisa, october 2011)
Admninistrative data (P&P, province of Pisa, december 2011) Eurostat

22 Methodology Data Extraction Aggregation Classification Statistics
CDR Data Extraction Aggregation Risk evaluati on Classification Statistics Interpretation Admin Data Validation Eurostat

23 Aggregation: Individual Call Profiles
The temporal aggregation is by week, where each day of a given week is grouped in weekdays and weekend Given for example a temporal window of 28 days (4 weeks), the resulting matrix has 8 columns (2 columns for each week, one for the weekdays and one for the weekend) A further temporal partitioning is applied to the daily hours. A day is divided in several timeslots, representing interesting times of the day Eurostat

24 Classification Profile Classification, i.e. the attribution of ICPs to the proper class was performed into two steps: Extraction of representative call profiles, i.e. a relatively small set of synthetic call profiles, each summarizing an homogeneous set of (real) ICPs This step reduces the set of samples to be manually classified. The labels assigned to the representative profiles are propagated to the full set of ICPs Eurostat

25 Classification The mean values of the ICPs belonging to each cluster serves as prototype / representative of the cluster The choice of the parameter K, equal to 100, was made by performing a wide range of experiments, trying to minimize the intra-cluster distance and maximizing the inter-cluster distance Once extracted the representatives (RCPs), they have been labeled by domain experts in the identified Profile Classes Eurostat

26 Classification The second step, i.e. the propagation of the labels manually assigned to the RCPs, followed a standard 1-Nearest-Neighbor (1-NN) classification step. That corresponds to assign to each ICP the label of the closest RCP Eurostat

27 Individual call profile
Resident Individual call profile A Dynamic resident B A Commuters Classification algorithm A A B Visitors Eurostat

28 A flow from A ->B defined by dynamic resident in B that work in A (commuters)
Commuter Dynamic Resident B A Eurostat

29 Comparison of estimations made starting from CDRs wrt Admin Data
GSM rescaled considering the market share of the operator Eurostat

30 Commuters (inbound flow)
Eurostat

31 Dynamic resident (outboun d flow)
Eurostat

32 Eurostat

33 Inbound commuters in Pisa
Eurostat

34 Inbound commuters in Pisa
Eurostat Inbound commuters in Pisa

35 Outbound commuters in Pisa
Eurostat Outbound commuters in Pisa

36 Conclusions for the Persons and Places Project
Semi-automatic methodology for estimation of population flows Good alignment with administrative data results First steps towards usage of mobile phone data for OS Eurostat

37 ICT Usage in Enterprises:
Recommendations from experimentations - 1 ICT Usage in Enterprises: Even unstructured data can be harnessed by OS. Very promising preliminary results in terms of quality of the estimates wrt questionnaire-based estimates Dedicated IT infrastructure for (i) scraping and (ii) scaling up Eurostat

38 Recommendations from experimentations - 2
Persons and Places: Privacy issues in dealing with mobile phone data. First positive solutions by Italian «Garante per la Privacy» Comparison with administrative data suggests reliability of mobile phone data estimaton (though still necessary to work for ensuring OS quality levels) Eurostat

39 References Persons and Places: Furletti, B., Gabrielli, L., Garofalo,G., Giannotti,F., Milli, L., Nanni,M., Pedreschi,D., Vivio, R.: Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach. SIS, Cagliari, 2014. Labour Market Estimation: Bacchini, F. , D’Alò, M., Falorsi,S., Fasulo, A., Pappalardo,A.: Does Google index improve the forecast of Italian labour market? SIS, Cagliari, 2014. ICT Usage: Barcaroli, G., Scannapieco, M., Nurra, A, Scarnò,M., Salamone, S., Summa, D.: Internet as Data Source in Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Vol44, no 2, 2015. Analyses techniques: Barcaroli G., De Francisci S., Scannapieco M., Big Data Analysis: Experiences and Best Practices in Official Statistics, Conference of European Statistical Stakeholders, Rome, 2014. IT issues: Barcaroli G., De Francisci S., Scannapieco M., Summa D.: Dealing with Big Data for Official Statistics: IT Issues; MSIS, Dublin, 2014 Introductory: Scannapieco M., Virgillito A., Zardetto D. : Placing Big Data in Official Statistics: A Big Challenge?, NTTS, Brussels, 2013. Eurostat


Download ppt "Italian Examples of the use of big data for producing statistics"

Similar presentations


Ads by Google