SESUG Web Scraping in SAS: A Macro-Based Approach

Slides:



Advertisements
Similar presentations
Introduction to Web Design Lecture number:. Todays Aim: Introduction to Web-designing and how its done. Modelling websites in HTML.
Advertisements

PART IV - EMBED VIDEO, AUDIO, AND DOCUMENTS. Find a video on Youtube.com: Search for a video, then look for the Embed code. Copy this code into the HTML/JavaScript.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 9 Using Perl for CGI Programming.
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
JavaScript FaaDoOEngineers.com FaaDoOEngineers.com.
The Librarian Web Page Carol Wolf CS396X. Create new controller  To create a new controller that can manage more than just books, type ruby script/generate.
CSCI 3100 Tutorial 6 Web Development Tools 1 Cuiyun GAO 1.
Html: getting started HTML is hyper text markup language. It is what web browsers look at on the Internet. HTML documents should be created in a simple.
Outline Proc Report Tricks Kelley Weston. Outline Examples 1.Text that spans columnsText that spans columns 2.Patient-level detail in the titlesPatient-level.
Introduction to HTML Lists, Images, and Links. Before We Begin Save another file in Notepad Save another file in Notepad Open your HTML, then do File>Save.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Multiple Tiers in Action
Computer Science 103 Chapter 2 HyperText Markup Language (HTML)
CGI Programming: Part 1. What is CGI? CGI = Common Gateway Interface Provides a standardized way for web browsers to: –Call programs on a server. –Pass.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
JAVASCRIPT HOW TO PROGRAM -2 DR. JOHN P. ABRAHAM UTPA.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Advanced Web 2012 Lecture 4 Sean Costain PHP Sean Costain 2012 What is PHP? PHP is a widely-used general-purpose scripting language that is especially.
SAS SQL SAS Seminar Series
1 Outline 3.1 Introduction 3.2 Editing HTML 3.3 First HTML Example 3.4 W3C HTML Validation Service 3.5 Headers 3.6 Linking 3.7 Images 3.8 Special Characters.
Server-side Scripting Powering the webs favourite services.
4-1 INTERNET DATABASE CONNECTOR Colorado Technical University IT420 Tim Peterson.
Stacking Rich Text Format (RTF) - %SRiT Duong Tran – Independent Contractor, London, UK Stacking Rich Text Format (RTF) - %SRiT Duong Tran – Independent.
Lesson 6 Links. Creating Folders  For every web site/page, you need to create a separate folder  The computer cannot find links if they are not stored.
2 1 Sending Data Using a Hyperlink CGI/Perl Programming By Diane Zak.
© Copyright by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Tutorial 30 – Bookstore Application: Client Tier Introducing.
Images in HTML PowerPoint How images are used in HTML.
Introduction to HTML. What is a HTML File?  HTML stands for Hyper Text Markup Language  An HTML file is a text file containing small markup tags  The.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Multiple Uses for a Simple SQL Procedure Rebecca Larsen University of South Florida.
INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Creating Dynamic Web Pages Using PHP and MySQL CS 320.
1 John Magee 9 November 2012 CS120 Lecture 17: The World Wide Web and HTML Web Publishing.
Define your Own SAS® Command Line Commands Duong Tran – Independent Contractor, London, UK Define your Own SAS® Command Line Commands Duong Tran – Independent.
Chapter 4 Applets Cop Why Applets? WWW makes huge information available to anyone with web browser. Web server send web pages and images to your.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
Javadoc Summary. Javadoc comments Delemented by /** and */ Used to document – Classes – Methods – Fields Must be placed immediately above the feature.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Introduction to HTML. _______________________________________________________________________________________________________________ 2 Outline Key issues.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Lesson 6 Links. Creating Folders  For every web site/page, you need to create a separate folder  The computer cannot find links if they are not stored.
Copyright © 2004, SAS Institute Inc. All rights reserved. SASHELP Datasets A real life example Barb Crowther SAS Consultant October 22, 2004.
Compare and Contrast : Blackboard & a Personal Web Page www3.ltu.edu/~s_schneider/howto/faculty.htm You’ll find this presentation (and another) here :
1 More About HTML Images and Links. 22 Objectives You will be able to Include images in your HTML page. Create links to other pages on your HTML page.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Technical SEO tips for Web Developers Richa Bhatia Singsys Pte. Ltd.
111 State Management Beginning ASP.NET in C# and VB Chapter 4 Pages
Using the Macro Facility to Create HTML, XML and RTF output Rick Langston, SAS Institute Inc.
Data Visualization with Tableau
Session 1 Retrieving Data From a Single Table
Web Systems & Technologies
Introduction to HTML.
Computing with C# and the .NET Framework
By Sasikumar Palanisamy
Introduction to Dynamic Web Programming
Images in HTML PowerPoint How images are used in HTML
Jonathan W. Duggins; James Blum NC State University; UNC Wilmington
CISC103 Web Development Basics: Web site:
Essentials of Web Pages
Chapter 27 WWW and HTTP.
Basic HTML and Embed Codes
3 Iterative Processing.
Introduction to JavaScript
Hunter Glanz & Josh Horstman
Introduction to Programming and JavaScript
Web Client Side Technologies Raneem Qaddoura
PHP-II.
Introduction to JavaScript
Images in HTML PowerPoint How images are used in HTML
Presentation transcript:

SESUG 236-2018 Web Scraping in SAS: A Macro-Based Approach Jonathan W. Duggins; James Blum NC State University; UNC Wilmington

Agenda Introduction Data Source Single-Page Scraping Multi-Page Scraping Conclusion

Introduction Motivation Scenario Common request from students and academic consulting clients Scenario Popular web comic, xkcd.com, chosen as case study

Data Source: Planning Ahead Robots.txt Where is the data? Stored directly in a file? Embedded in HTML code on a single page Embedded in HTML code across multiple pages

Single-Page Scraping Direct access not very common Embedded HTML code  parsing Access source code Identify tags, line numbers, etc. Separate HTML code from data Two main approaches: DATA step vs. SQL

Single-Page: Access Source Code Temporarily store source code in logical memory PROC HTTP! FILENAME source TEMP; PROC HTTP URL = "https://xkcd.com/archive/" OUT = source METHOD = "GET"; RUN; Web address Fileref Request sent to URL

Single-Page: Identify Data <head> <link rel="stylesheet" type="text/css" href="/s/b0dcca.css" title="Default"/> <title>xkcd - A webcomic of romance, sarcasm, math, and language - By Randall Munroe</title> <meta http-equiv="X-UA-Compatible" content="IE=edge"/> <link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/> <link rel="icon" href="/s/919f27.ico" type="image/x-icon"/> <link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/> <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/> <script type="text/javascript" src="/s/b66ed7.js" async></script> <script type="text/javascript" src="/s/1b9456.js" async></script> </head> <h1>Comics:</h1><br/> (Hover mouse over title to view publication date)<br /><br /> <a href="/2050/" title="2018-9-24">6/6 Time</a><br/> <a href="/2049/" title="2018-9-21">Unfulfilling Toys</a><br/> ~8000 lines total ~25% contain data I want to scrape Data I want is in these lines

Single-Page: Parsing Source Code <a href="/2050/" title="2018-9-24">6/6 Time</a><br/> DATA work.xkcdArchive(DROP = line); INFILE source LENGTH = recLen LRECL = 32767; INPUT line $VARYING32767. recLen; RUN; Read in source code from temporary fileref LENGTH= required for $VARYING informat

Single-Page: Parsing Source Code <a href="/2050/" title="2018-9-24">6/6 Time</a><br/> DATA work.xkcdArchive(DROP = line); INFILE source LENGTH = recLen LRECL = 32767; INPUT line $VARYING32767. recLen; FORMAT num 4. date DATE9.; LENGTH title $ 50; RUN; Data I’m parsing includes: Comic number, Date published, and Comic Title

Single-Page: Parsing Source Code <a href="/2050/" title="2018-9-24">6/6 Time</a><br/> DATA work.xkcdArchive(DROP = line); INFILE source LENGTH = recLen LRECL = 32767; INPUT line $VARYING32767. recLen; FORMAT num 4. date DATE9.; LENGTH title $ 50; IF FIND(line, '<a href=') GT 0 AND FIND(line, 'title=') GT 0 RUN; Identify lines like those shown above: Anchor tags with HREF= and TITLE= sufficient here

Single-Page: Parsing Source Code <a href="/2050/" title="2018-9-24">6/6 Time</a><br/> DATA work.xkcdArchive(DROP = line); INFILE source LENGTH = recLen LRECL = 32767; INPUT line $VARYING32767. recLen; FORMAT num 4. date DATE9.; LENGTH title $ 50; IF FIND(line, '<a href=') GT 0 AND FIND(line, 'title=') GT 0 THEN DO; num = INPUT(scan(strip(line),2,'/'),4.); date = INPUT(scan(strip(line),4,'"'),YYMMDD10.); title = SCAN(strip(line),2,'<>'); OUTPUT; END; RUN; Extract data

Single-Page: Results

Multi-Page Scraping Necessitated by large and/or changing data Must determine how pages are defined Including URL Parameters www.example.com/?page=2 or www.example.com/?Year=2012&?Month=5 Updating Web Server Path www.example.com/page2 or www.example.com/2012/5 Macro for looping

Multi-Page Scraping: Macro Variables xkcd.com uses web server path based on comic number Get list and count of comic numbers PROC SQL NOPRINT; SELECT DISTINCT num, count(DISTINCT num) INTO :indexList separated BY ' ', :indexCount FROM work.xkcdArchive ; QUIT; Save as macro variables Previously scraped single page

Multi-Page Scraping: Macro Setup %macro Scrape(start=1, stop=&indexCount, out=, append=, sleep=5); %if %sysfunc(compress(%upcase(%substr(&append,1,1)))) = N %then %do; proc datasets library = work nolist; delete all; run; quit; %end; %else %put QC_ERROR: Append status must be Yes or No. &=append; %do i = &start %to &stop; %let index = %scan(&indexList, &i); filename source temp; proc http url="https://xkcd.com/&index/" out=source method="GET"; data work.xkcdRawTemp; infile source length=len lrecl=32767; input line $varying32767. len; LineNum = _n_; Appending? Begin loop Access source for a single page

Multi-Page Scraping: SQL proc sql noprint; create table work.xkcdParsedTemp as select &index as comicNum, a.*, b.*, c.*, d.*, e.* from (select htmldecode(scan(line,2,'<>')) as title from work.xkcdRawTemp where line contains 'div id="ctitle"') as a, (select distinct htmldecode(scan(line,4,'"')) as Hover, htmldecode(scan(line,6,'"')) as AltHover where line contains '<img src=' and line contains 'title') as b, (select htmldecode(strip(scan(line,6,' <'))) as PermLink where line contains 'Permanent link to this comic:') as c, (select htmldecode(strip(scan(line,5,' '))) as EmbedLink where line contains 'Image URL (for hotlinking/embedding)') as d, (select LineNum, htmldecode(line) as test from work.xkcdrawTemp where (LineNum ge (select min(LineNum) where line contains '<div id="transcript"') and LineNum le (select min(LineNum) where line contains '</div>' and LineNum ge (select min(LineNum) where line contains '<div id="transcript"')))) as e; quit; Subqueries and inline views used to build records

Multi-Page Scraping: Data Cleaning data work.xkcdClean; set work.xkcdParsedTemp; if find(test,'[[') gt 0 then do; startsq = find(test,'[[')+2; test = strip(compress(substr(test,startsq),'[]')); end; if find(test,'{{') gt 0 or find(upcase(test), 'ALT TEXT') gt 0 then do; startcu = find(test,'{{')+1; test = strip(compress(substr(test,startcu),'{}')); if find(test,'<div') gt 0 then test = strip(scan(test,2,'>')); test = strip(tranwrd(tranwrd(tranwrd(tranwrd(test,']]',''),'}}',''),'</div>',''),'</div','')); run; proc append base = work.all data = work.xkcdClean; data _null_; call sleep(&sleep,1); %end; proc transpose data = work.all out = work.&out (drop = _name_) prefix = Line; by comicnum title hover altHover permLink EmbedLink; id linenum; var test; %mend; Data cleaning Append and wait Rotate

Single-Page: Results

Conclusion Beware potential pitfalls DATA step vs SQL Inconsistent HTML code, website updates Efficiency DATA step vs SQL Versatility of PROC HTTP