WEB SCRAPING FOR JOB STATISTICS

Slides:



Advertisements
Similar presentations
Making the Case for Metadata at SRS-NSF National Science Foundation Division of Science Resources Statistics Jeri Mulrow, Geetha Srinivasarao, and John.
Advertisements

SEO Yearly Plan For 6 Keywords Basic SEO :10,000 per month Advanced: 15, 000 per month Super SEO: 20, 000 per month Complete SEO: 25, 000 per month *Prices.
Challenges in designing mixed mode in business surveys Dr Mojca Noc Razinger, Statistical Office of the Republic of Slovenia.
Business Register Guidelines for Small Developing Nations Proposal for Discussion Geoff Mead and Ron Mckenzie Statistics New Zealand
Electronic reporting in Poland 27th Voorburg Group Meeting Warsaw, Poland October 1st to October 5th, 2012 Central Statistical Office of Poland.
Usage of new data sources at SORS Boro Nikić, Tomaž Špeh, Zvone Klun Statistical Office of the Republic of Slovenia Washington, 29 April - 1 May 2015.
Quality in the Swedish Business Database The Quality Survey 2004 Round Table Beijing 2004 Swedish presentation, session 5, 18 th Round Table, Beijing –
Emerging methodologies for the census in the UNECE region Paolo Valente United Nations Economic Commission for Europe Statistical Division International.
Population Census carried out in Armenia in 2011 as an example of the Generic Statistical Business Process Model Anahit Safyan Member of the State Council.
Google Analytics for Small Business Presented by: Keidra Chaney.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Editing of linked micro files for statistics and research.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
May 12-15, Evaluating the Integrated Census Israel Pnina ZADKA Central Bureau of Statistics Israel.
1 Statistical business registers as a prerequisite for integrated economic statistics. By Olav Ljones Deputy Director General Statistics Norway
Overview and challenges in the use of administrative data in official statistics IAOS Conference Shanghai, October 2008 Heli Jeskanen-Sundström Statistics.
Big Data activities at SURS Statistical Office of the Republic of Slovenia DIME/ITDG meeting, February 2016.
Regional Seminar on Developing a Program for the Implementation of the 2008 SNA and Supporting Statistics Cenker Burak METİN September 2013 Ankara.
The use of administrative data for the production of official economic statistics in Brazil - current situation and challenges for the future Shanghai,
Quality and Reasonable SEO/SMO services
Community of Practice K Lead Project Team: الالتزامالتحفيز التفكير المؤسسي المرونةالتميزالشراكةالاستقامة.
ROMA 23 GIUGNO 2016 MODERNISATION LAB - FOCUSSING ON MODERNISATION STRATEGIES IN EUROPE: SOME NSIS’ EXPERIENCES Insert the presentation title Modernisation.
Data Science in Official Statistics: The Big Data Team
Establishing a Statistical Business Register (BR)
Miami, Florida, 9 – 12 November 2016
Sharing of previous experiences on scraping Istat’s experience
UH + Website Accessibility
CSCE 587: Big Data Analytics
Compilation and Dissemination of Distributive Trade Statistics
Using Apps to Get and Share Information
Surveys on prices at the Statistical office of the Republic of Slovenia (SURS) Mojca Noč Razinger.
ESSNet Pilot: Web Scraping for Job Vacancy Statistics
WP1: Web scraping Job Vacancies- ELSTAT
Multi-mode data collection
Redesigning French structural business statistics, using more administrative data ICESIII, Montréal, june 2007.
Removing Duplicate Job Ads
Istituto Nazionale di Statistica – Istat
David Freeman Labour Market Division Office for National Statistics
United Nations Development Account 10th Tranche Statistics and Data
Big Data ESSNet: Web Scraping for Job Vacancy Statistics Nigel Swier UK Office for National Statistics.
Implementation of a more efficient way of collecting data SBS: use of administrative data Statistics Belgium June 2009.
Dublin, april 2012 Role of Business Register in coordinated sampling
Quality Aspects and Approaches in Business Statistics
New ways to get the data Multiple mode and big data
Guidelines on the use of SBR for business demography and entrepreneurship statistics Tammy Hoogsteen (Statistics Canada) and Norbert Rainer (co-chair.
ESS Vision 2020 Recent developments
ESSNet Pilot: Web Scraping for Job Vacancy Statistics
ESTP programme for 2016 Živilė Aleksonytė-Cormier
Statistical Office of the Republic of Slovenia
Italian situation in the following areas:
New sources for the SBR: first evaluations on the feasibility
Business Register Redesign Technology Strategy Plan
Hiring Sarah Worrel.
Boro Nikic WP1&WP2 meeting Rome, November 2016
Use of Web scraping for Enterprises Characteristics
Passenger Mobility Statistics 2017
LAMAS Working Group October 2014
Big Data ESSNet WP 1: Web scraping / Job Vacancies Pilot
Regional Seminar on Developing a Program for the Implementation of the 2008 SNA and Supporting Statistics Gülçin ERDOĞAN September 2013 Ankara.
Statistical units in the public sector
The change of data sources in the Spanish SILC
Sampling and estimation
Parallel Session: BR maintenance Quality in maintenance of a BR:
DIAGNOSTIC FRAMEWORK: National Accounts and Supporting Statistics
Working towards a central Register : Simple, Complete and Widely Accessible September 29, 2010 Session no 5 - Register quality as a common task : Cooperation.
CMP Creating Your Personal and Small Business Web Sites
Joint UNECE/Eurostat/OECD
Big Data in Official Statistics: Generalities
Survey of employers about flexible educational paths
Presentation transcript:

WEB SCRAPING FOR JOB STATISTICS Boro Nikic International Conference on Big Data for Official Statistics Dublin, September 2016

Big data implementation strategy at SURS Short term goal Long term goal Classical Statistical Survey Data processing Improvement of existing statistics (nonresponse, precision, reduce the burden,..) Big Data source 1 Big Data source 2 …. Existing Source Big Data Source 1 Big Data Source 2 … Existing Source n Integration and processing of data Existing statistics Flash estimates New statistics Occasionallclassical Statistical Survey for quality assessment

Current Survey on JV Observed unit Enterprise (legal unit) Probability Sample (8942 units) Stratified (activity, size), Administrative part of sample (3633 units) Enterprise under majority government control Reference point Last day of middle month of each quarter Mode of collection of data WEB, CATI, Administrative source, Contact center (email, telephone,..) Statistics Number of JV ads broken down by activity and size

Current Survey on JV Observed population: LU of business activities from B to S (sectors) Reference period: Last working day in the middle month in every quarter. Statistics: Totals of advertised job vacancies Total number of Job Vacancies on observed population and domains defined by size and activity: 2 Size classes: 1-9 employees; 10+ employees 18 activity classes: B-S Totals of advertised job vacancies on: population, population broken down by activities, units with 10+ employees units with 10+ employees broken down by activities

Methodology of scraping of Job portals (1) Tasks: Identify the most important Job portals in Slovenia Identify web scraping tools for scraping structural data Collect the (mostly) structured data on identified Job portals Record linkage with business Register (Id of enterprise activity, size,..) Remove the duplicates, detect the specialized agencies Analysis of data (sample coverage, combining with other sources…) Detect other important variables (job title, skills,..) Not implemented yet

Methodology of scraping of enterprise websites (2) There are around 30 Job portals in Slovenia. Two of the most important ones cover more then 95% JV ads. Since May 2016 weakly collection of data from those two portals.

Structure of the scraped data

Record linkage with BR 1 - merging by unique short name of enterprise 2 - merging by unique compete name of enterprise 3 - merging by non-unique short name and location of enterprise 4 - Record linkage by using distance function (short name of enterprise) 5 -- Merging by non-unique complete name and location of enterprise 6 - Record linkage by using distance function (complete name of enterprise) 7 - Manual (agencies, biger enterprises)

Duplicates Key for merging: name of enterprise, job title, location

IT tools involved in scraping Job portals Data Scraping Studio SAS Contextual Analytics SCRAPING OUTPUT FILE STORAGE STATISTICAL PRODUCTION

Methodology of scraping of enterprise websites (1) Identify URL links of enterprises Identify sub links which potentially contain JV ads Employ machine learning techniques detecting the JV ads from list of contents of sublinks Detect variables (locaation, job title, skills,..) Not implemented yet

Methodology of scraping of enterprise websites (2) 2486 URL links (out of 8942 units in sample) of enterprises Sources: ICT Survey, Business Register, Phonebook... Excluded: international enterprises, Webshops, media, job portals, specialized agencies Pilot on units in ICT Survey

Methodology of scraping of enterprise websites (3) Concept of keywords; searching of sublinks of website up to depth 3 http://ikos.si/ (depth 0) http://ikos.si/storitve/ http://ikos.si/jobseekers/ http://ikos.si/reference/ (depth 1) http://ikos.si/jobseekers database/ http://ikos.si/job vacancies/ (depth 2) http://ikos.si/job vacancies/6/welder/ (depth 3) http://ikos.si/prosta_delovna_mesta/3/waiter/ (depth3)

Methodology of scraping of enterprise websites (3) For each webpage the HTML source of the page is checked If the link begin with ftp ali sftp, it is excluded Domain must remain the same, otherwise it is excluded https://mercator.si/ http://www.mercator-ip.si/job/ Links containing images, pdf or word documents, video, facebook or twitter accounts.. are excluded

Methodology of scraping of enterprise websites (4)

Methodology of scraping of enterprise websites (5) Only text is scraped from every sub-link with potentially JV ads

Methodology of scraping of enterprise websites (6) For detection an JV ads machine learning techniques are employed Hand-defined list of 1000 examples in February (200 Ads and 800 non Ads) Logistic regression model is used Results; 1349 potentially sub-links (397 enterprises) are detected. Manually checked: 160 JV ads (484 JV)

Methodology of scraping of enterprise websites (6) Results: May, 2016

IT tools involved in scraping enterprise websites SAS Contextual Analytics SCRAPING OUTPUT FILE ORANGE IDENTIFYING JV ADS STATISTICAL PRODUCTION http://orange.biolab.si/

Coverage: Sample vs. scraped&admin data Reported data Scraped& admin data Job portals Enterprise websites Number of JV ads 4312 2321 1073 262 Percentage 100% 54% 25% 6%

Foreseeing activities Identify URLs of enterprises Improve the scalability of application for identifying sub URL links (Sandbox) Text mining (unstructured data) Special treatment of specialized JV agencies (Adecco, Hill international…) Identify other channels which are used for advertisements of JV (additional questions in questionnaire) Include the scraped data in phases of collecting and processing sample data

Thank you for your attention! boro.nikic@gov.si