Web scraping tools, a real life application ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1 Guido van den Heuvel, Dick Windmeijer, Olav ten Bosch, Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Aim of this afternoon Build a web scraper for a web site of your choice with the CBS Robot Framework Learn about web technology (HTML, CSS, XPath) Learn about the Robot Framework Introduce some useful tools for inspecting web sites Hands-on experience with configuring and running the Robot Framework
Overview Introducing the Robot Framework Data extraction Coffee break Site navigation
The CBS Robot Framework Used for automated site navigation and data extraction Rule based configuration Does not require programming But: allows programming for advanced use Uses a full-blown browser (phantomjs) Works with rendered pages, not page source Includes a JavaScript engine Generates CSV data files and extensive logs.
Framework config Format: JSON Different sections startUrls (actually, a Node.js JavaScript module) Different sections startUrls extractionRules navigationRules (and some others, which are for advanced use) Show a real-life config file as an example
JSON quick reference name:value assign value to named property "string" character string number number { } object (set of properties) [ ] array of values See also: http://www.json.org/
StartUrls One or more start URLs Each start URL is a separate object Must have a unique name Must contain url property May contain extractionContext and/or navigationContext properties Show example from the Ikea.nl config
StartUrls quick reference startUrls: { startVariable: "site", <any_site_name>: { url: "http://... ", extractionContext: "overview" navigationContext: "menu" }, ... } see also: Framework user manual, section 2.3
Running the Framework Config directory: RobotConfig\ESTP The following commands are available: newrobot <robotname> initialises a new, empty framework config runrobot <robotname> runs a robot Output directories: RobotOutput\ESTP\<robotname>\data RobotOutput\ESTP\<robotname>\log
Exercise 1: "Hello, world" Initialise a config file and run it. Inspect the output generated. Choose a site to scrape, and choose one page of this site to extract data from. Add the URL from b) as the start URL to your config file and run again. Once more, inspect the output. What has changed since the previous run?
Items and properties Items: some item of interest on a web page Example, web shop: products sold Example, news site: articles published Property: one piece of information about an item Examples, web shop: name, description, brand, price Examples, new site: title, body text, author, date Example: Ikea.nl overview page
HTML syntax Tags Text content Attributes Important tags: <a>, <p>, <h?>, <div>, <span>, <ul> / <li>, <table> / <tr> / <td>, <body>, <html> Text content Attributes id class Show participants an example of the HTML code of a web page using Firebug. Use Ikea as the example website of choice throughout the presentation.
HTML Tags quick reference <a> Hyperlink <p> Paragraph <h?> Header. “?” is a single digit between 1 and 6 <div> Section; Rectangular block of content <span> Line of text <ul> / <li> Unordered List / List item <table> Table <tr> / <td> Table row / Table cell <body> Document body: visible part of the page <html> The entire HTML document See also: http://www.w3schools.com/tags/
CSS selectors Originally used in “Cascading Style Sheets” to denote which tags have specific layout In conjunction with HTML class attribute Layout often has semantic meaning E.g., product names, prices, … have specific layouts Class name often reflects this meaning Used in scrapers to select specific parts of web pages Show an example of CSS and the use of class attributes on an example Ikea.nl web page Show an example of a CSS selector in Firepath to select all the items on a product overview page on Ikea.nl.
CSS Selectors quick reference tag Select tags with indicated tag name #id Select tag with the indicated id .class Select tags with indicated class [attr=value] Select tags for which attribute equals value tag.class select tags with indicated tag name and class selector1 selector2 select tags obeying selector2 within tags obeying selector1 selector1>selector2 as previous, but children only selector1,selector2 select tags obeying selector1 or selector2 See also: http://www.w3schools.com/cssref/css_selectors.asp Again, illustrate by means of the Ikea.nl example
extractionRules First select items from which to extract data Then select, for each item, elements to extract Selection by means of CSS selectors extractionContext links start urls and extraction rules Use the extraction rules with the same name as the extraction context Discuss the example extraction rules from Ikea.nl
extractionRules quick reference extractionRules: { <extraction_context_name>: { cssSelector: "<item selector>" <column_name>: { cssSelector: "<property selector>", operation: "getXmlValue" } see also: Framework user manual, section 2.7
Exercise 2: Items of Interest Identify the items on your chosen web page that you want to extract data from. Compose a CSS selector to select these items. Test with Firebug & Firepath. Add an extraction context to the config and include this CSS selector as item selector. Run the robot with this config. Inspect the output: What has changed since the previous run?
Exercise 3: Gathering Data Identify a single property from the items selected in exercise 2 that you want to extract. Compose a CSS selector for this property. Include this property in the config. Run the config and inspect the output. Repeat a) to d) with other properties of interest.
Site navigation overview Menus Top / Side menu: often hyperlinks Pulldown / mouseover menu: combination of CSS and JavaScript Multi-level menus Next page button Often implemented in JavaScript: AJAX Filters, facets Almost always implemented in JavaScript, sometimes client-side
XPath selectors XPath: language to select tags in [X/HT]ML code Similar to CSS selectors, but much more powerful Syntax somewhat comparable to directory names HTML can be seen as a hierarchy, just like a file system Example: html/body/div/h1/a
XPath syntax overview /tag find tags as children of the current tag //tag find tags as descendants of current tag [n] select the nth tag of the indicated type [condition] select tags which obey the given condition @attribute select the indicated attribute of the current tag text() select the text contents of the current tag =, != comparison operators: equal to / not equal to id('<id>') select the tag with the indicated id See: http://www.w3schools.com/xsl/xpath_syntax.asp http://www.w3schools.com/xsl/xpath_operators.asp
XPath examples //ul[@class='nav2']//a[text()='Politics'] Select all hyperlinks with link text "Politics" inside a <ul> tag with class "nav2" //div[contains(@class, 'next')] Selects all <div> tags for which the class attribute contains the word "next" (id('main-menu')//ul/li)[3] First, select all <li> tags which are children of <ul> tags inside a tag with id "main-menu", then select the 3rd of these.
Exercise 4: One small step Find the link (probably in a menu) you followed to the web page you used in ex. 1-3. This link should be on a different page on the same site. Compose an XPath selector to select this link. Add a navigation rule with this XPath selector to the config and run it. What other parts of the config do you need to change for this test? Inspect the output.
Exercise 5: A giant leap Find some other pages on the site you chose for which you would like to extract data. Do they have the same structure as the one from ex 1-3? Find out how to navigate to these pages. Add extra navigation rules to your config to visit these pages. If necessary, add extra extraction contexts / rules. Run config after each change and inspect output.