Web Scraping for Collecting Price Data: Are We Doing It Right?

Web Scraping for Collecting Price Data: Are We Doing It Right?
Riccardo Giannini Federico Polidoro Francesco Pugliese Antonino Virgillito Istat – Istituto Nazionale di Statistica

Web scraping of prices is a common practice in NSIs
Introduction The collection of data from the Internet through the extraction of structured content from web pages is an established technique for statistical data collection Replace repetitive centralized tasks Possibility for getting more data Price data is particularly attractive… A lot of prices on the Internet! Web scraping of prices is a common practice in NSIs But…are we doing it right?

Web Scraping Tools Ad-hoc development: programs written in a general-purpose programming language (such as Java or Python) Low-level approach where an IT developer writes a specific program for each web sites being scraped, instructed to extract the data from the page and navigate throughout all the sites. Most general approach, that can lead to better quality of the output data because it can incorporate data cleaning and pre-processing Browser automation: tools that allow to record sequences of actions on a browser and to reproduce them indefinitely in automated mode (e.g. iMacros) Replace the tedious manual collection activity that is needed for extracting prices. The downside of these tools is that the recording activity has to be repeated for every page that and as such it does not scale well over several web sites. Point-and-click: tools that are capable to detect the portions of a page containing data and make these data accessible in a structured, tabular format (e.g. import.io) Can work in an completely unassisted way or can be instructed by users whenever needed. The success of the scraping operation is highly dependent on how the page is designed.

Web Scraping Tools Ad-hoc development: programs written in a general-purpose programming language (such as Java or Python) Low-level approach where an IT developer writes a specific program for each web sites being scraped, instructed to extract the data from the page and navigate throughout all the sites. Most general approach, that can lead to better quality of the output data because it can incorporate data cleaning and pre-processing Browser automation: tools that allow to record sequences of actions on a browser and to reproduce them indefinitely in automated mode. Replace the tedious manual collection activity that is needed for extracting prices. The downside of these tools is that the recording activity has to be repeated for every page that and as such it does not scale well over several web sites. Point-and-click: tools that are capable to detect the portions of a page containing data and make these data accessible in a structured, tabular format. Can work in an completely unassisted way or can be instructed by users whenever needed. The success of the scraping operation is highly dependent on how the page is designed. Difficulty of maintenance Dependence from IT Flexibility With all the tools, scrapers have to be trained for each different target web site!

Web Scraping Prices at Istat
17 types of products currently collected through scraping in production Consumer electronic products: collection of prices from 4 different e-commerce web sites, including Amazon. Energy sector: collection of gas tariffs from the conditions published on the web site one major Italian energy provider. Fiscal sector: current taxes rates collected from the Minister of Finance web site. Financial sector: cost of bank accounts, collected from a web portal that allows to compare offering and costs of bank accounts for consumers. Transport sector: cost of tickets for trains and flights (Experimental) Scrapers developed in iMacros plus a wrapper in Java that automates launching the scraper and storing collected data in a RDBMS

Example of iMacros execution
(Train tickets) Training the scraper is a reverse engineering activity Infer semantics from style attribute Scrapers are easier to implement and work better when sites are well-designed

Maintenance Ordinary Extraordinary
Statistical users frequently change the input data Extraordinary IT developers change the macro when web sites change Occurrence: once a month on average (not so extraordinary) Some scrapers have never been touched since first implementation but others change frequently (and sometimes temporarily!) E.g. Amazon “Prime day”

Problems Sustainability Scale
The more we develop scraping solutions, the more maintenance is required Scale Scraping for prices is substantially a replacement for manual collection activity Difficult to collect large data size Data must be selected manually before collection Is it Big Data?...

Collecting Air Fares Air ticket prices are and their dynamics is difficult to analyze when sampling prices on a single day per month Web sites selling flight tickets (airlines, online travel portals) present characteristics in the structure of the page that make difficult the use of automation tools We developed a custom program to collect air fares on two web sites (EasyJet and Edreams) Based on the Selenium tool for Java Designed to run without an operator Use of “headless browser” for automatically execute the scraper without the need for having a human in front of a computer Pro: - Can collect a large number of prices on a daily basis Cons: - High dependence from a Java programmer for maintenance of the (complex) code - Only two web sites supported – training is costly

An Idea for Future Work Scraping the entire content of a web site in order to ideally extract all the prices contained in it Implementing a machine learning approach by exploiting a simplified text mining technique for the universal detection of price data on the page Simpler than a general text mining problem: Products information and price are normally inserted into a structure (such as a table) Relevant data is easily recognizable from syntactic characteristics (e.g. the presence of the Euro sign or the number format). Expected benefits Simplified training and maintenance Possible application to a large number of pages Real collection of Big Data!

Conclusions Web scraping is a consolidated method for data collection of prices Significant improvements in efficiency have been achieved through tools and techniques that are now mature and familiar The risk we are not able to reach to the next level of scale and exploit the full potential of web data Are we ready for trying new approaches? Good topic for collaboration

Web Scraping for Collecting Price Data: Are We Doing It Right?

Similar presentations

Presentation on theme: "Web Scraping for Collecting Price Data: Are We Doing It Right?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Scraping for Collecting Price Data: Are We Doing It Right?

Similar presentations

Presentation on theme: "Web Scraping for Collecting Price Data: Are We Doing It Right?"— Presentation transcript:

Similar presentations

About project

Feedback