Download presentation
Presentation is loading. Please wait.
1
Use of Web scraping for Enterprises Characteristics
First Specific Grant Agreement on "Big Data Pilots I” Bruxelles Preparatory Meeting 7-8 January 2015 Work Package 2 – Use of Web scraping for Enterprises Characteristics Giulio Barcaroli, Stefano De Francisci, Paolo Righi, Monica Scannapieco Italian National Institute of Statistics (Istat) First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
2
Outline Description of Workpackage 2 Purposes Tasks (SGA1 and SGA2)
Legal, technological and methodological challenges Possible use cases Connections with Workpackage 1 First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
3
Workpackage 2 – Use of web scraping for enterprises characteristics
In this ESSnet, one of the pilot projects regards the use of websites as a source for official statistics, and this can be divided into two different packages: use of web scraping for job vacancy statistics use of web scraping for enterprises characteristics While the first is more orientated towards an inferential approach, the second is more towards a predictive one. Nonetheless, also the second should investigate the possibility to produce statistical outputs. A number of common activities between the two packages will be illustrated. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
4
Description: purposes
The general aim is to investigate whether web scraping, in connection with techniques as text mining and machine learning, can be used to collect information concerning enterprises. This can be done at different levels: at micro level, in order to enrich the information in business registers; at aggregate level, in order to produce statistical output. At both levels, quality (accuracy) of results needs to be evaluated. While (1) is the primary target of this pilot, also (2) is relevant, and can be carried out jointly with (1). First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
5
Description: tasks Four different tasks: Data access: Data collection:
for each involved partner: (i) individuation of the enterprises population of interest and of related available registers; (ii) retrieval of URLs pertaining to each target enterprise; investigation on legal aspects concerning web scraping (in cooperation with Eurostat). Data collection: identification of information needs (in cooperation with Eurostat): consultation of stakeholders and users; review of available techniques for massive web scraping (Jsoup, HTTrack, etc.) in coordination with UNECE Sandbox and development of one or more applications; Scraping and storage of enterprise websites content in involved countries. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
6
Description: tasks Four different tasks (continued):
Data analysis and processing On a sample of enterprises, carrying out independent identification of characteristics, by means of others sources (e.g. ICT survey) or manual inspection of websites (possibly using crowdsourcing platforms); Application of text mining and machine learning techniques to predict characteristics of the enterprises; Calculate quality indicators (accuracy, sensitivity, specificity). Further developments (in SGA2) choice of the best predictor in order to predict characteristics of the enterprises, to be added to Enterprises Registers; using predicted values, proceed to calculate estimates of population parameters by adopting sound inference procedures; evaluation of related Mean Square Error. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
7
Challenges This pilot is meant to investigate three different issues:
Legal Technological Methodological First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
8
Legal aspects Websites are, by definition, open to everyone, and their content in principle should not be subject to privacy concerns. Nonetheless: member states statistical laws may set restrictions on mass scraping of websites, without informing owners about the use of collected data; owners may protect their websites by preventing the access of spiders. All legal aspects have to be considered both at European and member states level (in coordination with Eurostat). First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
9
Technologies for web scraping
We can distinguish two different kinds of web scraping: specific web scraping, when both structure and content of websites to be scraped are perfectly known, and crawlers just have to replicate the behavior of a human being visiting the website and collecting the information of interest. Typical areas of application: data collection for price consumer indices (ONS, CBS, Istat); generic web scraping, when no a priori knowledge on the content is available, and the whole website is scraped and subsequently processed in order to infer information of interest. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
10
Technologies for web scraping
Different solutions for the web scraping will be investigated. For instance, in Istat experiments we have already tested: the Apache suite Nutch/Solr ( for crawling, content extraction, indexing and searching results is a highly extensible and scalable open source web crawler; HTTrack ( ), a free and open source software tool that permits to “mirror” locally a web site, by downloading each page that composes its structure. In technical terms it is a web crawler and an offline browser; JSOUP ( ) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system ( this latter selected as already including facilities that allow to handle huge data sets and textual information. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
11
Average number of webpages per site
Technologies for web scraping These techniques will be evaluated by taking into account: efficiency: number of websites actually scraped on the total and execution performance; effectiveness: completeness and richness of collected text that can influence the quality levels of prediction. Solution # websites reached Average number of webpages per site Time spent Type of Storage Storage dimensions Nutch 7020 / 8550=82,1% 15,2 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 hours HTML ADaMSoft compressed binary files 500MB First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
12
Methodological aspects
The aim is: at micro level, to enrich the information in business registers; at aggregate level, to produce aggregate statistical output. Both activities imply the definition of a methodological framework, enabling to predict characteristics of each enterprise, and to infer the values of parameters in the population. This framework should also allow to assess the quality of both predictions and estimates. The joint use of instruments as text mining and machine learning, from one side, and statistical models, from the other, is typical of data science. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
13
Use cases Some specific uses cases of interests concern the estimation from enterprises web sites of the following information: • whether an enterprise performs or not e-commerce • whether an enterprise manages job vacancies on its site • presence in social media (Facebook, Twitter, … ) • contact information: location, contact s, etc. • profiling information: type of activity, links with other enterprises, etc. The final determination should be done in coordination with Eurostat. First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
14
Connection with Workpackage 1
First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics
15
Thank you for your attention
First Specific Grant Agreement on "Big Data Pilots I” Work Package 2 - Use of Web scraping for Enterprises Characteristics 15 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.