Istituto Nazionale di Statistica – Istat ESSnet Big data WP2 Workplan Monica Scannapieco Istituto Nazionale di Statistica – Istat
Web Scraping Enterprises Characteristics: Objectives Main objectives: to demonstrate whether business registers can be improved by predicting values of some key variables starting from scraped data to verify the possibility to produce statistical outputs using predicted data
Web Scraping Enterprises Characteristics: use cases Initial set of use cases in the proposal: whether an enterprise performs e-commerce or not whether an enterprise manages job vacancies on its site presence in social media contact information: location, contact emails, etc. profiling information: type of activity, links with other enterprises, etc.
Work organization - 1 Four tasks: Task 1 – Data access Task 2 – Data handling Task 3 – Testing of Methods and Techniques Task 4 – Finalization of Methods and Techniques Task 1,2,3 in SGA1 (within 31/7/2017) Task 4 foreseen for SGA-2
Work organization - 2 Participants (Effort P/M): IT – 92 BG – 200 NL – 45 PL – 100 SE – 55 UK – 50
Task 1: Data access 1.1 Inventory of enterprises target of the web scraping Dependance from task 2: use case refinement and «specialization» for each country 1.2 Identification of URLs Ad-hoc software tools to retrieve them when not available 1.3 Legal aspects and privacy issues Jointly with WP1
Task 2: Data handling 2.1 Detailed use cases definition coordination with ESS.VIP “European System of Interoperable Statistical Business Registers” 2.2 Choice of techniques and technologies and set up of the working environment Sandbox? 2.3 Carrying out scraping activities and sharing of results among participants
Task 3: Testing of Methods and Techniques Testing activity that will be enriched and finalized in SGA-2 Select some use cases, out of the defined ones, that allow us to have a good representativeness of the overall potential statistical outputs and information to enrich business registers. Build a proof of concept of the selected use cases to predict characteristics of the enterprises by applying text and data mining techniques.
Deliverables and milestones for SGA - 1 To Anticipate for reviewing Deliverables Due date Report with legal aspects Month 12 (January 2017) Technical and methodological report describing web scraping, prediction and inference procedures Month 18 (July 2017) Milestone Progress and technical report of first internal WP-meeting month 4 (May 2016)
Gantt: Proposal M1 (Feb) M3 (April) M6 (July) M9 (October) M12 (Jan) Task 1: Data Access 1.1 Inventory 1.2 URLs 1.3 Legal aspects Task 2: Data Handling 2.1 Use cases 2.2 IT architecture 2.3 Scraping Task 3: Testing 3.1 Proof of Concept
Agenda of the meeting 23 March 2016 14:00-14:30 Overview of WP2 workplan (M. Scannapieco, Istat) 14:30-16:30 Sharing of previous experiences on scraping Istat’s experience (G. Barcaroli) ONS’s experience (R. Breton) CBS’s experience (O. ten Bosch) 16:30-17:30 Characteristics of National Business Registers (all participants)
Agenda of the meeting 24 March 2016 9:30-10:30 The issues of URLs retrieval (G. Barcaroli, M.Scannapieco, Istat) 10:30-11:30 Legal issues (all participants) 11:30 -12:30 Use case definition and stakeholder involvement (all participants) Lunch break 14:00 -15:00 Working environment and tools (Istat) 15:00 -15:30 Interaction with WP1 (M. Scannapieco/R.Breton) 15:30 -16:00 Wrap-up and To Do activities (M. Scannapieco)