ESSNet Pilot: Web Scraping for Job Vacancy Statistics
Current Official Estimates (Survey) Rationale Current Official Estimates (Survey) Web data Frequency Monthly Real-time? Industry Sector Enterprise Size Job type / skills Sub-national National Totals More frequent More timely More granular Cheaper???
Participants (SGA-1) United Kingdom (lead) Germany Sweden Slovenia Italy Greece
Broad Approach Understand the landscape of web-based job vacancy data in each country Focus first on job portals, later explore enterprise websites Try to replicate existing outputs, then investigate opportunities to produce new types of output. Develop specific approaches that are appropriate to the circumstances in each country Develop common approaches where possible
Data Access 1. Web scraping Job Portals 2. Job Portal APIs 4. Public Sector Agencies 3. Web scraping Enterprise Websites 5. Commercial Suppliers
Job Portals – Evaluation Criteria What 1. Position 2. Occupation 3. Education 4. Type of job (temporary or permanent, full-time, or part time) When 5. Date of advertised vacancy 6. Date of application deadline 7. Date to fill a vacancy Where 8. Location of job Who 9. Direct employer or agency 10. Economic activity of employer (NACE)
Classification of Job Portals 2. Job Search Engines 1. Job Boards 3. Hybrid
Conceptual Definitions Job Ad Job Vacancy
Conceptual Definitions Job Ad Job Vacancy
Conceptual Definitions Job Ad Job Vacancy
Conceptual Definitions Job Ad Job Vacancy “Ghost “ Vacancy
Coverage Issues ‘Ghost’ Vacancies Target Population: All job vacancies Employing business identifiable Advertised through agency Advertised on a job portal Advertised on enterprise website
Assessing Coverage Job Portal Job Portal Job Portal Enterprise Advertising employer differs from reporting unit Trading name differs from legal name Duplicate names on business register Enterprise Matching Business Register Job Vacancy Survey
Removing Duplicates Concatenated list Final deduplicated list Job Portal Concatenated list Deduplicate Final deduplicated list 1. Create common variable list: Job_title Job_description Location_city Location_region Date_posted Enterprise name 2. Clean data: e.g. " .NET Developer - Stoke-On-Trent - £35-£40K " 3. Run dedup to produce candidate matches 4. Active learning step (manual coding of > 100 records) 5. Rerun to automatically remove “duplicate” job ads
Conclusion Job portal data is very rich, but complex and messy Difficult to align to established statistical concepts Need to understand coverage issues and how to tackle them Making progress but a long way to go.
Future Steps Produce measures of job portal coverage Explore approaches for enhancing coverage (including web scraping enterprise websites) Develop methods for combining vacancy survey and job ads from the web Develop methods for feature extraction and coding/classifying textual data (to enrich existing survey data) Explore other uses of on-line job vacancy data
Future Steps Additional ESS partners joining from July 2017: Portugal Belgium France Denmark … the beginnings of a longer term network?
Thank you for your attention!