MIS 324 -- Professor Sandvig MIS 424 Professor Sandvig 12/31/2018 Screen Scraping MIS 424 Professor Sandvig
MIS 324 -- Professor Sandvig 12/31/2018 Today What is Screen Scraping Also called web scraping When to use it How Legal Issues
What is Screen Scraping MIS 324 -- Professor Sandvig 12/31/2018 What is Screen Scraping Programmatically “scraping” information from a web page Two steps: Retrieve Page Scrape desired information Regular Expressions
MIS 324 -- Professor Sandvig 12/31/2018 When to Use Data not available via more direct methods: APIs Designed to expose data Structured web services RSS database
MIS 324 -- Professor Sandvig 12/31/2018 When to Use Examples Search engines Google, Bing, Yahoo, … News sites Google news, Yahoo news, … PadMapper, MapCraigs Scrape Craigslist Interface with Legacy Systems No support for web services, RSS, etc.
MIS 324 -- Professor Sandvig 12/31/2018 How Handout: ScreenScrape Example: scrape CBE Faculty/Staff Directory
MIS 324 -- Professor Sandvig 12/31/2018 Legal Issues Potential to violate copyright laws Many lawsuits: LinkedIn sues 100 individuals for scraping user data (Oct. 2016) Europe battles Google News over 'snippet tax' proposal Belgian Newspapers Claim Retaliation By Google After Copyright Victory
MIS 324 -- Professor Sandvig 12/31/2018 Legal Issues MapCraigs.com Scraped Craigslist real estate Displayed on Google maps Blocked IP PadMapper vs. Craigslist lawsuit Paid Craigslist $1,000,000 History: Is Web Scraping Legal? Use cautiously
Summary Screen Scraping Useful tool for collecting data from web pages When API not available Many legal uses: Search engines Legacy systems Can violate copyrights