Sharing of previous experiences on scraping Istat’s experience

Sharing of previous experiences on scraping Istat’s experience
Essnet on Big Data WP2 - Webscraping / Enterprise Characteristics Sharing of previous experiences on scraping Istat’s experience ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

Sharing of previous experiences on scraping - Istat’s experience
We can distinguish two different kinds of web scraping: specific web scraping, when both structure and content of websites to be scraped are perfectly known, and crawlers just have to replicate the behaviour of a human being visiting the website and collecting the information of interest. Typical areas of application: data collection for price consumer indices (ONS, CBS, Istat); generic web scraping, when no a priori knowledge on the content is available, and the whole website is scraped and subsequently processed in order to infer information of interest: this is the case of the “ICT usage in enterprises” pilot. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

So far, in Istat there have been a number of web scraping experiences, involving: prices related to a set of goods and services; agritourism farms’ portals; scraping of enterprises websites in the Enterprises ICT survey. In the first two cases, specific scraping has been employed, while the generic one has been used for the enterprises websites. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

Web scraping data acquisition for Harmonized Index of Consumer Prices (European project “Multipurpose price statistics”) It is currently performed on two groups of products: Consumer electronics Airfares These prices are collected by simulation of on line purchases. Applications making use of iMacros. Next: use of open source software as “rvest” (R package for web scraping). ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

Use of web scraping for agritourism data collection In Italy there are about 20,000 agritourism farms (AF). An annual survey is carried out to collect information on them. The same information, and more, can be obtained by directly accessing and scraping their sites on the web. Instead of accessing each single website, a limited set of «hubs» would be scraped, each one containing information related to many AFs. In this case, «specific» scraping would be used. A relevant problem here would be the treatment of duplications and incoherent information, as the same AF can be present in more than one hub. A critical step is related to «record linkage», as it is necessary to refer correctly the information collected in different hubs to a given AF. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage in enterprises The web questionnaire is used to collect information on the characteristics of the websites owned or used by the enterprises: In a first phase, the aim of the experiment was to predict values of questions from B8a to B8g using machine learning techniques applied to texts (text mining) scraped from the websites. Particular effort was dedicated to question B8a (“Web sales facilities” or “e-commerce”). ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage in enterprises ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage: web scraping Different solutions for the web scraping will be investigated. For instance, in Istat experiments we have already tested: the Apache suite Nutch/Solr ( for crawling, content extraction, indexing and searching results is a highly extensible and scalable open source web crawler; HTTrack ( ), a free and open source software tool that permits to “mirror” locally a web site, by downloading each page that composes its structure. In technical terms it is a web crawler and an offline browser; JSOUP ( ) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system ( this latter selected as already including facilities that allow to handle huge data sets and textual information. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage: web scraping
Sharing of previous experiences on scraping - Istat’s experience ICT usage: web scraping These techniques will be evaluated by taking into account: efficiency: number of websites actually scraped on the total and execution performance; effectiveness: completeness and richness of collected text that can influence the quality levels of prediction. Solution # websites reached Average number of webpages per site Time spent Type of Storage Storage dimensions Nutch 7020 / 8550=82,1% 15,2 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 hours HTML ADaMSoft compressed binary files 500MB UNDER TESTING: ad hoc solution based on Jsoup and Jcrawler ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage: text mining 2013 and 2014 rounds of the survey have both been used in the experiment. For all respondents declaring to own a website, their website have been scraped, and collected texts submitted to classical text mining procedures in order to build a “matrix terms/documents”. Different learners have been applied, in order to predict values of target variables (for instance, “e-commerce (yes/no)”) on the basis or relevant terms individuated in the websites. The relevance of the terms (and consequent selection of 1,200 out of 50,000) has been based on the importance of each term measured in the analysis of correspondence. 2013 data have been used as “train” dataset, while 2014 data have been used as “test” dataset. The performance of each learner has been evaluated by means of the usual quality indicators: accuracy: rate of correctly classified cases on the total; sensitivity: rate of correctly classified positive cases on total positive cases; specificity: rate of correctly classified negative cases on total negative cases. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage: e-commerce prediction
Sharing of previous experiences on scraping - Istat’s experience ICT usage: e-commerce prediction Learner Quality Indicators for e-commerce Accuracy Sensitivity Specificity Proportion of e-commerce (observed) e-commerce (predicted) GLM (Logistic) 0.69 0.68 0.19 0.22 Random Forest 0.79 0.63 0.83 0.25 Neural Network 0.70 0.62 0.72 0.20 Boosting 0.67 0.66 Bagging 0.82 0.38 0.92 Naïve Bayes 0.75 0.55 0.21 LDA 0.71 0.65 0.28 RPART (Tree) 0.95 0.16 ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage: e-commerce prediction ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

ICT usage: prediction of other variables ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

So far, the pilot explored the possibility to replicate the information collected by the questionnaire using the scraped content of the website and applying the best predictor (reduction of respondent burden). A more relevant possibility is to combine survey data and Big Data in order to improve the quality of the estimates. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

The aim is to adopt a full predictive approach with a combined use of data: all the websites owned by the whole population of enterprises are individuated and their content collected by web scraping (= Big Data); survey data (the “truth ground”) are combined with Big data in order to establish relations (models) between the values of target variables and the terms collected in corresponding scraped websites; estimated models obtained in step 2 are applied to the whole set of texts obtained in step 1 in order to produce estimates related to the target variables. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March

Thank you for your attention ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome March 16 16

Sharing of previous experiences on scraping Istat’s experience

Similar presentations

Presentation on theme: "Sharing of previous experiences on scraping Istat’s experience"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sharing of previous experiences on scraping Istat’s experience

Similar presentations

Presentation on theme: "Sharing of previous experiences on scraping Istat’s experience"— Presentation transcript:

Similar presentations

About project

Feedback