Uses of web scraping for official statistics

Slides:



Advertisements
Similar presentations
Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Advertisements

Olav ten Bosch MSIS, Dublin, April 2014 On the use of internet robots for official statistics.
Tradition innovation Online Branding Kate Legg Solicitor.
Presentation of Iraqi Legal Database Phase III. Presentation Outline 1.The ILD’s homepage. 2.Searching by Reference. 3.Searching by Subject. 4.Searching.
Website Content, Forms and Dynamic Web Pages. Electronic Portfolios Portfolio: – A collection of work that clearly illustrates effort, progress, knowledge,
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Lecturer: Ghadah Aldehim
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
1 California State University, Fullerton Chapter 8 Personal Productivity and Problem Solving.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1999 Asian Women's Network Training Workshop Tools for Searching Information on the Web  Search Engines  Meta-searchers  Information Gateways  Subject.
UNIT 14 1 Websites. Introduction 2 A website is a set of related webpages stored on a web server. Webmaster: is a person who sets up and maintains a.
Frameworks for the Access and Use of Administrative Data, With the Example of Current Practice in the UK Steven Vale Office for National Statistics UK.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
MATT DIXONARCHITECT CANDACE REMALYPROJECT MANAGER SPENCER SMITHBUSINESS ANALYST BRYAN LINTHICUMDEVELOPER ADAM STERNFELDTESTER Not Even Funny [Property.
MADAN MOHAN MALAVIYA ENGINEERING COLLEGE,GORAKHPUR Submitted by: TANUJA SRIVASTAV ( ) Submitted To: Mrs. Meenu Assistant professor CSE Department.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Web bugs are tiny graphic files embedded in messages and Web pages that are designed to monitor who is reading the message or Web page and.
Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands.
Web Content And Customer Relationship Management Solution. Transforming web sites into a customer-focused, revenue generating channel with less stress.
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Transforming official Statistics
Web Scraping for Collecting Price Data: Are We Doing It Right?
Data mining in web applications
Chapter 17 The Need for HTML 5.
Search Engine Optimization
Web Software Year 11.
UNIT 3 INFORMATION SYSTEMS
Objectives Overview Identify the four categories of application software Describe characteristics of a user interface Identify the key features of widely.
Chapter 8 Browsing and Searching the Web
Live Customer Support Solution
DePaul Bears Try Your Luck!.
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
City of Lakewood, California - Lakewood Online key points
E-commerce | WWW World Wide Web - Concepts
Strategies for improving Web site performance
E-commerce | WWW World Wide Web - Concepts
WRITE MARKETING COPY and EXECUTE TARGETED S
Legal and Ethical Issues in E-Commerce
Introducing the World Wide Web
Section 15.1 Section 15.2 Identify Webmastering tasks
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
UNIT 15 Webpage Creator.
Web Site Project Management
Big Data ESSNet: Web Scraping for Job Vacancy Statistics Nigel Swier UK Office for National Statistics.
Competitor Price Monitoring
System And Application Software
Web scraping tools, an introduction
WRITE MARKETING COPY and EXECUTE TARGETED S
OneSupport Help Center (OSHC) Training
Changed Data Collection Strategies
Use of Web scraping for Enterprises Characteristics
Big Data Sources – Web, Social media and Text Analytics
Planning and Storyboarding a Web Site
Strategies for collecting prices on Internet
Web scraping tools, a real life application
Processing bulk data from the Internet
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Copyright & Fair Use What You Need to Know!.
3 EGR Identification Service
Privacy and personal data protection
ECONOMIC CLASSIFICATIONS Advanced course Day 1 – third afternoon session Tools for assisting the use of classifications Zsófia Ercsey - KSH – Hungary.
Kanban Task Manager SharePoint Editions ‒ Introduction
Presentation transcript:

Uses of web scraping for official statistics ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1 Olav ten Bosch, Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Outline Introduction Web scraping and official statistics Examples at Statistics Netherlands How it works Challenges Legal Wrap up

Introduction (1) Web scraping is automatically retrieving (and processing) information from websites. Web scraping is as old as the internet itself. There would not be any search engine without web scraping. Web scrapers use the same internet techniques that browsers use to visit web sites. Web scrapers use the structure and contents of web pages to identify and scrape relevant information.

Introduction (2) There are many different flavours of web scraping, varying from harvesting very diverse information from multiple sites to very dedicated scrapers focussed on particular items on a single site fully automatic scrapers operating silently to programs that require user interaction (an assistant) one-time scraping for a research project to scraping for use in production Web scraping has many names: Crawlers, harvesting, spiders, bots, internet robots

Web scraping and official statistics Web sites contain detailed and frequently updated information that may be useful to official statistics. We can enrich the results of the traditional methods with automatically collected information. Web data may enable new types of indicators that are not feasible with traditional methods. The collection and analysis time is much faster than performing these tasks manually.

Internet sources Faster,better,more efficient Administrative sources Tax, social security Municipalities/ Provinces Supermarkets … Surveys New indicators Internet sources Less!!!

Competitors?

Some use cases Internet sources Prices of airline tickets Real estate sites for housing statistics Internet vacancies for job statistics Social media sentiment for consumer confidence Trade in second-hand goods as economic indicators Use wikipedia for improving the business register Internet prices for CPI Internet sources

Ticket prices according to robot and manual collection from web sites

New Analysis: Volatility of flight prices of four destinations starting 116 days before departure.

Tracking housing market (2011) Identifying the source web sites and understanding the characteristics of the data they provide was a challenge. We scraped 5 sites, about 250.000 observations / week, for 2 years Dutch property market statistics use results of the research using these robots. After research phase CBS obtained data directly

Volatility of the content of one of the housing websites. Positive bars are properties added, negative are properties deleted.

Bulk price collection for CPI (2012->) Mainly clothing Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: About 500.000 price observations daily from 10 sites Data from 3 sites used in production of Dutch CPI Price collection process embedded in organisation Plans to extend to > 20 sites and other domains

How it works (1) Internet Requests Graphical markup Website code, images, style, data etc. Browser You

How it works (2) Navigation Internet Requests Website code, Scraper/ images, style, data etc. Scraper/ Robot/ Crawler/ Spider You Data

How it works (3) Live demo of website communication Navigation Internet Requests Website code, images, style, data etc. Scraper/ Robot/ Crawler/ Spider Monitor data stream Lots of data

Challenges in web scraping (1) Which data? It is not always easy to know which site to scrape It starts with detecting the most up-to-date and complete sources The internet is full of copies of data Which data is relevant, up to date, reliable? Sometimes choose between scraping the data owner or an aggregator site Can we get the data from the data owner directly, without scraping? Advice: explore the data flows among sites

Challenges in web scraping (2) The internet is dynamic Each web site has a particular structure, which may be subject to changes anytime The technology used on websites changes continuously (example: infinite scroll) Sometimes websites do not comply with standards Advise: try to build scrapers as robust as possible Data is volatile Be aware of changing data patterns over time Advise: monitor data frequently

Example of volatile data Number of products per product category (right axis, red line) and the average price based on these products (left axis, blue line)

Challenges in web scraping (3) Legal issues and informing web site owners When using internet data in production, how to organise your scraping process? Advice: Manage the organisation of the collection process and transform collected data into a standardized format.

Legal (1) Legislation for scraping may be country specific Below is inspired by the Dutch situation From National Statistics Law: Enterprises have to provide data to the NSI on request. Scraping may reduce response burden Database legislation: Commercial re-use of scraped intellectual property from internet sources is forbidden NSI’s usually uses data for official statistics only

Legal (2) Privacy: Netiquette (practical): Legislation on protection of personal information At this moment we only scrape public sources and process data within NSI’s safe environment, just as with other (privacy-sensitive) data internally Netiquette (practical): respect the Robots Exclusion Protocol also known as the robots.txt (example) identify yourself (user-agent) do not overload servers, use some idle time between requests, run crawlers at night / morning Inform website owners if feasible

Wrap up We have explained the basics of web scraping, retrieving information from the internet Web scraping has many different flavours, we have seen a few Web scraping is useful for official statistics in different ways, not only as primary source There are still challenges in scraping, data processing and also legal issues

Thanks for listening! Any comment or question?