Download presentation
Presentation is loading. Please wait.
1
Istituto Nazionale di Statistica – Istat
ESSnet Big data WP2 Workplan Monica Scannapieco Istituto Nazionale di Statistica – Istat
2
Web Scraping Enterprises Characteristics: Objectives
Main objectives: to demonstrate whether business registers can be improved by predicting values of some key variables starting from scraped data to verify the possibility to produce statistical outputs using predicted data
3
Web Scraping Enterprises Characteristics: use cases
Initial set of use cases in the proposal: whether an enterprise performs e-commerce or not whether an enterprise manages job vacancies on its site presence in social media contact information: location, contact s, etc. profiling information: type of activity, links with other enterprises, etc.
4
Work organization - 1 Four tasks: Task 1 – Data access
Task 2 – Data handling Task 3 – Testing of Methods and Techniques Task 4 – Finalization of Methods and Techniques Task 1,2,3 in SGA1 (within 31/7/2017) Task 4 foreseen for SGA-2
5
Work organization - 2 Participants (Effort P/M): IT – 92 BG – 200
NL – 45 PL – 100 SE – 55 UK – 50
6
Task 1: Data access 1.1 Inventory of enterprises target of the web scraping Dependance from task 2: use case refinement and «specialization» for each country 1.2 Identification of URLs Ad-hoc software tools to retrieve them when not available 1.3 Legal aspects and privacy issues Jointly with WP1
7
Task 2: Data handling 2.1 Detailed use cases definition
coordination with ESS.VIP “European System of Interoperable Statistical Business Registers” 2.2 Choice of techniques and technologies and set up of the working environment Sandbox? 2.3 Carrying out scraping activities and sharing of results among participants
8
Task 3: Testing of Methods and Techniques
Testing activity that will be enriched and finalized in SGA-2 Select some use cases, out of the defined ones, that allow us to have a good representativeness of the overall potential statistical outputs and information to enrich business registers. Build a proof of concept of the selected use cases to predict characteristics of the enterprises by applying text and data mining techniques.
9
Deliverables and milestones for SGA - 1
To Anticipate for reviewing Deliverables Due date Report with legal aspects Month 12 (January 2017) Technical and methodological report describing web scraping, prediction and inference procedures Month 18 (July 2017) Milestone Progress and technical report of first internal WP-meeting month 4 (May 2016)
10
Gantt: Proposal M1 (Feb) M3 (April) M6 (July) M9 (October) M12 (Jan)
Task 1: Data Access 1.1 Inventory 1.2 URLs 1.3 Legal aspects Task 2: Data Handling 2.1 Use cases 2.2 IT architecture 2.3 Scraping Task 3: Testing 3.1 Proof of Concept
11
Agenda of the meeting 23 March 2016
14:00-14:30 Overview of WP2 workplan (M. Scannapieco, Istat) 14:30-16:30 Sharing of previous experiences on scraping Istat’s experience (G. Barcaroli) ONS’s experience (R. Breton) CBS’s experience (O. ten Bosch) 16:30-17:30 Characteristics of National Business Registers (all participants)
12
Agenda of the meeting 24 March 2016
9:30-10:30 The issues of URLs retrieval (G. Barcaroli, M.Scannapieco, Istat) 10:30-11:30 Legal issues (all participants) 11:30 -12:30 Use case definition and stakeholder involvement (all participants) Lunch break 14:00 -15:00 Working environment and tools (Istat) 15:00 -15:30 Interaction with WP1 (M. Scannapieco/R.Breton) 15:30 -16:00 Wrap-up and To Do activities (M. Scannapieco)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.