WP1: Web scraping Job Vacancies- ELSTAT Hellenic Statistical Authority WP1: Web scraping Job Vacancies- ELSTAT Christina Pierrakou – Eleni Bisioti Essnet Big Data: WP1:Webscraping/Job Vacancies
Overview Introduction Web Scraping Tools Deduplication Web scraping experiment Matching Results Next Steps
Introduction Scrape ads directly from two job portals using a web scraping tool Collect key variables: job title job description Location Company name Posted date Salary and job type (full time/temporary)
Web Scraping Tools Import.io “point and click” tool for general scraping purposes Content Grabber setting scraping agents, waiting for selectors handling error cases
Deduplication In the structure of a job portal there is a specific “point” from where one could scrape data and produce data sets without duplicates. This approach worked well for each portal More work is needed to examine the removal of duplicates in the joint data set.
Web Scraping Experiment Matching advertising businesses on job portals with enterprises contained in the: Sample of Job Vacancy Survey Statistical Business Register.
Matching Results (1/3) Sample: 3060 deduplicated ads Period: 15.6.2016 - 15.8.2016. 55% of ads, the employing enterprises were identified 45% of ads, the employing enterprises were not identified
Matching Results (2/3) No company name available for 77% of ads Systematic way of starting such as “Leading Company…”; or “Well Known Firm…” etc. Probably “ghost vacancies”
Matching Results (3/3) 256 enterprises were identified 9% of enterprises matched sample of Job Vacancy Survey 30% of enterprises were matched Statistical Business Register
Classification of Companies by Economic Activities (NACE rev.2)
Classification of Ads by Economic Activities (NACE rev.2)
Next Steps Main focus is continuing the work on matching the enterprises names from job portals with JV survey data and Statistical Business Register to understand the job portal coverage.
Thank you very much for your attention c.pierrakou@statistics.gr e.bisioti@statistics.gr