Web scraping tools, a real life application

Slides:



Advertisements
Similar presentations
Table, List, Blocks, Inline Style
Advertisements

Introduction to HTML & CSS
HTML: HyperText Markup Language Hello World Welcome to the world!
Today CSS HTML A project.
MIS 425 Lecture 2 – HTML Navigation, Colors, tables and Styles Instructor: Martin Neuhard
Web Pages and Style Sheets Bert Wachsmuth. HTML versus XHTML XHTML is a stricter version of HTML: HTML + stricter rules = XHTML. XHTML Rule violations:
HTML and Web Page Design Presented by Frank H. Osborne, Ph. D. © 2005 ID 2950 Technology and the Young Child.
XHTML1 Tables and Lists. XHTML2 Objectives In this chapter, you will: Create basic tables Structure tables Format tables Create lists.
XP 1 Working with Cascading Style Sheets Creating a Style for Online Scrapbooks Tutorial 7.
Working with Cascading Style Sheets. 2 Objectives Introducing Cascading Style Sheets Using Inline Styles Using Embedded Styles Using an External Style.
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
Chapter 14 Introduction to HTML
Tags through Forms. This element is required for all HTML pages It must be at the top of every page of every website We’ll see later on why it is important.
 Missing (or duplicate) semicolons can make the browser completely ignore the style rule.  You may add extra spaces between the properties/values to.
Working with Text and Cascading Style Sheets Adobe Dreamweaver Chapter 3.
Review HTML  What is HTML?  HTML is a language for describing web pages.  HTML stands for Hyper Text Markup Language  HTML is not a programming language,
Cascading Style Sheet. What is CSS? CSS stands for Cascading Style Sheets. CSS are a series of instruction that specify how markup elements should appear.
Chapter 4 Dreamweaver: Part II The Web Warrior Guide to Web Design Technologies.
Chapter 4 Cascading Style Sheets Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D 1.
XP Tutorial 7New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with Cascading Style Sheets Creating a Style for Online Scrapbooks.
 ult.htm ult.htm  This website illustrates the use of CCS (style sheets)
Chapter 3 Working with Text and Cascading Style Sheets.
Copyright © Texas Education Agency, All rights reserved. 1 Web Technologies Website Development with Dreamweaver.
Cascading Style Sheets CSS.  Standard defined by the W3C  CSS1 (released 1996) 50 properties  CSS2 (released 1998) 150 properties (positioning)  CSS3.
Chapter 2 HTML Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D 1.
HTML | DOM. Objectives  HTML – Hypertext Markup Language  Sematic markup  Common tags/elements  Document Object Model (DOM)  Work on page | HTML.
Copyright 2007, Information Builders. Slide 1 Understanding Basic HTML Amanda Regan Technical Director June, 2008.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
XP Tutorial 7New Perspectives on HTML and XHTML, Comprehensive 1 Working with Cascading Style Sheets Tutorial 7.
HTML: Hyptertext Markup Language Doman’s Sections.
>> HTML: Structure Elements. Elements in HTML are either Inline or Block. Block-level Elements – Begins on a new line – Occupy the whole width – Stacks.
Intro To Web Design with Adobe Dreamweaver CSS Cascading Style Sheets (CSS) is the W3C standard for defining the presentation of documents written in HTML,
© 2011 Delmar, Cengage Learning Chapter 3 Working with Text and Cascading Style Sheets.
HTML.
Cascading Style Sheets CSS.  Standard defined by the W3C  CSS1 (released 1996) 50 properties  CSS2 (released 1998) 150 properties (positioning)  CSS3.
Jozef Goetz, STEM Summer Camp Dr. Jozef Goetz.
CNIT 132 – Week 4 Cascading Style Sheets. Introducing Cascading Style Sheets Style sheets are files or forms that describe the layout and appearance of.
Week 2: Building a Simple Website IMC 320 Web Publishing Spring 2011.
Basic HTML Document Structure. Slide 2 Goals (XHTML HTML5) XHTML Separate document structure and content from document formatting HTML 5 Create a formal.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
HTML5 and CSS3 Illustrated Unit C: Getting Started with CSS.
CSS Layout Cascading Style Sheets. Lesson Overview  In this lesson, we’ll cover:  Brief CSS review  Creating sections with the tag  Creating inline.
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML.
XP Tutorial 7New Perspectives on HTML and XHTML, Comprehensive 1 Working with Cascading Style Sheets Creating a Style for Online Scrapbooks Tutorial 7.
Week 1: Introduction to HTML and Web Design
Introduction to CSS: Selectors
Working with Cascading Style Sheets
Objective % Select and utilize tools to design and develop websites.
Getting Started with CSS
Organizing Content with Lists and Tables
Working with Tables: Module A: Table Basics
Elements of HTML Web Design – Sec 3-2
>> Introduction to CSS
HTML: HyperText Markup Language
Elements of HTML Web Design – Sec 3-2
ASP.NET Web Controls.
Elements of HTML Web Design – Sec 3-2
Objective % Select and utilize tools to design and develop websites.
Introduction to web design discussing which languages is used for website designing
WEBSITE DESIGN Chp 1
Basic HTML Document Structure
Web Programming A different world! Three main languages/tools No Java
Web scraping tools, an introduction
Exercise 9 Skills You create and use styles to create formatting rules that can easily by applied to other pages in the Web site. You can create internal.
HTML / CSS Mai Moustafa Senior Web Designer eSpace eSpace.
Computer communications
Web Programming and Design
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML
Presentation transcript:

Web scraping tools, a real life application ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1 Guido van den Heuvel, Dick Windmeijer, Olav ten Bosch, Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Aim of this afternoon Build a web scraper for a web site of your choice with the CBS Robot Framework Learn about web technology (HTML, CSS, XPath) Learn about the Robot Framework Introduce some useful tools for inspecting web sites Hands-on experience with configuring and running the Robot Framework

Overview Introducing the Robot Framework Data extraction Coffee break Site navigation

The CBS Robot Framework Used for automated site navigation and data extraction Rule based configuration Does not require programming But: allows programming for advanced use Uses a full-blown browser (phantomjs) Works with rendered pages, not page source Includes a JavaScript engine Generates CSV data files and extensive logs.

Framework config Format: JSON Different sections startUrls (actually, a Node.js JavaScript module) Different sections startUrls extractionRules navigationRules (and some others, which are for advanced use) Show a real-life config file as an example

JSON quick reference name:value assign value to named property "string" character string number number { } object (set of properties) [ ] array of values See also: http://www.json.org/

StartUrls One or more start URLs Each start URL is a separate object Must have a unique name Must contain url property May contain extractionContext and/or navigationContext properties Show example from the Ikea.nl config

StartUrls quick reference startUrls: { startVariable: "site", <any_site_name>: { url: "http://... ", extractionContext: "overview" navigationContext: "menu" }, ... } see also: Framework user manual, section 2.3

Running the Framework Config directory: RobotConfig\ESTP The following commands are available: newrobot <robotname> initialises a new, empty framework config runrobot <robotname> runs a robot Output directories: RobotOutput\ESTP\<robotname>\data RobotOutput\ESTP\<robotname>\log

Exercise 1: "Hello, world" Initialise a config file and run it. Inspect the output generated. Choose a site to scrape, and choose one page of this site to extract data from. Add the URL from b) as the start URL to your config file and run again. Once more, inspect the output. What has changed since the previous run?

Items and properties Items: some item of interest on a web page Example, web shop: products sold Example, news site: articles published Property: one piece of information about an item Examples, web shop: name, description, brand, price Examples, new site: title, body text, author, date Example: Ikea.nl overview page

HTML syntax Tags Text content Attributes Important tags: <a>, <p>, <h?>, <div>, <span>, <ul> / <li>, <table> / <tr> / <td>, <body>, <html> Text content Attributes id class Show participants an example of the HTML code of a web page using Firebug. Use Ikea as the example website of choice throughout the presentation.

HTML Tags quick reference <a> Hyperlink <p> Paragraph <h?> Header. “?” is a single digit between 1 and 6 <div> Section; Rectangular block of content <span> Line of text <ul> / <li> Unordered List / List item <table> Table <tr> / <td> Table row / Table cell <body> Document body: visible part of the page <html> The entire HTML document See also: http://www.w3schools.com/tags/

CSS selectors Originally used in “Cascading Style Sheets” to denote which tags have specific layout In conjunction with HTML class attribute Layout often has semantic meaning E.g., product names, prices, … have specific layouts Class name often reflects this meaning Used in scrapers to select specific parts of web pages Show an example of CSS and the use of class attributes on an example Ikea.nl web page Show an example of a CSS selector in Firepath to select all the items on a product overview page on Ikea.nl.

CSS Selectors quick reference tag Select tags with indicated tag name #id Select tag with the indicated id .class Select tags with indicated class [attr=value] Select tags for which attribute equals value tag.class select tags with indicated tag name and class selector1 selector2 select tags obeying selector2 within tags obeying selector1 selector1>selector2 as previous, but children only selector1,selector2 select tags obeying selector1 or selector2 See also: http://www.w3schools.com/cssref/css_selectors.asp Again, illustrate by means of the Ikea.nl example

extractionRules First select items from which to extract data Then select, for each item, elements to extract Selection by means of CSS selectors extractionContext links start urls and extraction rules Use the extraction rules with the same name as the extraction context Discuss the example extraction rules from Ikea.nl

extractionRules quick reference extractionRules: { <extraction_context_name>: { cssSelector: "<item selector>" <column_name>: { cssSelector: "<property selector>", operation: "getXmlValue" } see also: Framework user manual, section 2.7

Exercise 2: Items of Interest Identify the items on your chosen web page that you want to extract data from. Compose a CSS selector to select these items. Test with Firebug & Firepath. Add an extraction context to the config and include this CSS selector as item selector. Run the robot with this config. Inspect the output: What has changed since the previous run?

Exercise 3: Gathering Data Identify a single property from the items selected in exercise 2 that you want to extract. Compose a CSS selector for this property. Include this property in the config. Run the config and inspect the output. Repeat a) to d) with other properties of interest.

Site navigation overview Menus Top / Side menu: often hyperlinks Pulldown / mouseover menu: combination of CSS and JavaScript Multi-level menus Next page button Often implemented in JavaScript: AJAX Filters, facets Almost always implemented in JavaScript, sometimes client-side

XPath selectors XPath: language to select tags in [X/HT]ML code Similar to CSS selectors, but much more powerful Syntax somewhat comparable to directory names HTML can be seen as a hierarchy, just like a file system Example: html/body/div/h1/a

XPath syntax overview /tag find tags as children of the current tag //tag find tags as descendants of current tag [n] select the nth tag of the indicated type [condition] select tags which obey the given condition @attribute select the indicated attribute of the current tag text() select the text contents of the current tag =, != comparison operators: equal to / not equal to id('<id>') select the tag with the indicated id See: http://www.w3schools.com/xsl/xpath_syntax.asp http://www.w3schools.com/xsl/xpath_operators.asp

XPath examples //ul[@class='nav2']//a[text()='Politics'] Select all hyperlinks with link text "Politics" inside a <ul> tag with class "nav2" //div[contains(@class, 'next')] Selects all <div> tags for which the class attribute contains the word "next" (id('main-menu')//ul/li)[3] First, select all <li> tags which are children of <ul> tags inside a tag with id "main-menu", then select the 3rd of these.

Exercise 4: One small step Find the link (probably in a menu) you followed to the web page you used in ex. 1-3. This link should be on a different page on the same site. Compose an XPath selector to select this link. Add a navigation rule with this XPath selector to the config and run it. What other parts of the config do you need to change for this test? Inspect the output.

Exercise 5: A giant leap Find some other pages on the site you chose for which you would like to extract data. Do they have the same structure as the one from ex 1-3? Find out how to navigate to these pages. Add extra navigation rules to your config to visit these pages. If necessary, add extra extraction contexts / rules. Run config after each change and inspect output.