CSCE 590 Web Scraping – Scrapy III

Slides:



Advertisements
Similar presentations
Tables Feb. 14, Tables Great way to organize and display information Laid out in columns and rows Think of an excel spreadsheet Defined with tag.
Advertisements

Introduction to Web Design Lecture number:. Todays Aim: Introduction to Web-designing and how its done. Modelling websites in HTML.
HTML Basics Customizing your site using the basics of HTML.
Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Html: getting started HTML is hyper text markup language. It is what web browsers look at on the Internet. HTML documents should be created in a simple.
ADA Compliant Websites & Documents What the heck am I supposed to do?
XHTML Basics.
CSW131 Steven Battilana 1 CSW 131 – Chapter 5 (More) Advanced CSS Prepared by Prof. B. for use with Teach Yourself Visually Web Design by Ron Huddleston,
Web Pages and Style Sheets Bert Wachsmuth. HTML versus XHTML XHTML is a stricter version of HTML: HTML + stricter rules = XHTML. XHTML Rule violations:
Need to define two things: › The destination › Something to click on to get there  Tag is click here  Can be text, special character or image (next.
TEA/TUG + ALDOT(Mobile) = H(O+I) The TEA/TUG being hosted by ALDOT in Mobile causes Hurricanes to come to Alabama. The TEA/TUG being hosted by ALDOT in.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Best Practices for Website Design & Web Content Management.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Direct Congress Dan Skorupski Dan Vingo. Inner workings Reminder: MVC design pattern Procedural view: From request to response o Request passed to a view.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
1 Test Automation For Web-Based Applications Selenium HP Web Test Tool Training Portnov Computer School.
Selenium Web Test Tool Training Using Ruby Language Discover the automating power of Selenium Kavin School Kavin School Presents: Presented by: Kangeyan.
CS 299 – Web Programming and Design Introduction to HTML.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Understanding HTML Code
© Copyright by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Tutorial 30 – Bookstore Application: Client Tier Introducing.
HTML Hyper Text Markup Language It is used for describing web documents or web pages. A markup language is set of markup tags. HTML documents are described.
Mid Morning Discussion Introduction to the IBIS-Q Front-End System ("Module" Emphasis) What is the front end Where does the front end fit in What is a.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
WEB DESIGN UNIT 2 Unit 2 Module 2-5. WHAT HAVE YOU LEARNED?  What is the title tag do? Where does it show?  What are the tags that need to be on every.
INTRODUCTION TO HTML5 Semantic Layout in HTML5.  The new semantic layout in HTML5 refers to a new class of elements that is designed to help you understand.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
HTML: Hyptertext Markup Language Doman’s Sections.
The Care and Maintenance of J5 MAC New Look. Naming Conventions  Each graphic and include item is named by function_contract area_secondary identifier.
Use CSS to Implement a Reusable Design Selecting a Dreamweaver CSS Starter Layout is the easiest way to create a page with a CSS layout You can access.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
Easy WP Guide V2.6 for WordPress 3.8. easywpguide.com Adding Tags within your Post Adding Tags whilst editing your Post, will automatically assign those.
Web Scraping with Python and Selenium. What is Web Scraping?  Software technique for extracting info from websites Get information programmatically that.
HTML + CSS II LET’S BUILD AN ACTUAL WEBSITE!. OVERVIEW This Session  Brief review of basics   “id” and “class”  Using a CSS template Upcoming Sessions.
LECTURE 13 Intro to Web Development. WEB DEVELOPMENT IN PYTHON In the next few lectures, we’ll be discussing web development in Python. Python can be.
01 – HTML (1) Informatics Department Parahyangan Catholic University.
Python: Programming the Google Search (Crawling) Damian Gordon.
HTML Basics.
Introduction to HTML:.
Data Scraping Presented by Stephen Popick & Chun Kuang (KC)
Bare boned notes.
Bare bones notes.
Lesson 14: Web Scraping Topic: Web Scraping.
Web Scraping with Scrapy
Basic Web Scraping with Python
Introduction to web design discussing which languages is used for website designing
CSCE 590 Web Scraping – XPaths
Chapter 27 WWW and HTTP.
CNIT 131 HTML5 – Anchor/Link.
Scrapy Web Cralwer Instructor: Bei Kang.
Web Scrapers/Crawlers
590 Web Scraping – testing Topics Readings: Chapter 13 - Testing
This module Provides some tips for data management
CSCE 590 Web Scraping – Scrapy II
590 Scraping – NER shape features
CSCE 590 Web Scraping – Scrapy III
CSCE 590 Web Scraping – Scrapy II
Web scraping tools, a real life application
How to debug a website using IE F12 tools
Bryan Burlingame 24 April 2019
Scrapy Web Cralwer Instructor: Bei Kang.
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Web Programming and Design
© 2017, Mike Murach & Associates, Inc.
590 Web Scraping – Test 2 Review
Presentation transcript:

CSCE 590 Web Scraping – Scrapy III Topics The Scrapy framework revisited Readings: Scrapy User manual – https://scrapy.org/doc/ https://doc.scrapy.org/en/1.3/ March 14, 2017

Scrapy Review scrapy startproject tut3 scrapy genspider postLoginForm "www.example.com” scrapy crawl postLoginForm scrapy shell postLoginForm

Scrapy Tutorial Quotes_spider import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) https://doc.scrapy.org/en/1.3/

quotes in http://quotes.toscrape.com <div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div>

$ scrapy shell 'http://quotes.toscrape.com' >>> quote = response.css("div.quote")[0] >>> title = quote.css("span.text::text").extract_first() >>> title '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' >>> author = quote.css("small.author::text").extract_first() >>> author 'Albert Einstein' https://doc.scrapy.org/en/1.3/

>>> tags = quote. css("div. tags a. tag::text") >>> tags = quote.css("div.tags a.tag::text").extract() >>> tags ['change', 'deep-thoughts', 'thinking', 'world'] https://doc.scrapy.org/en/1.3/

>>> for quote in response. css("div. quote"):. text = quote >>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").extract_first() ... author = quote.css("small.author::text").extract_first() ... tags = quote.css("div.tags a.tag::text").extract() ... print(dict(text=text, author=author, tags=tags)) {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'} {'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'} ... a few more of these, omitted for brevity >>> https://doc.scrapy.org/en/1.3/

<ul class="pager"> <li class="next"> <a href="/page/2/"> Next <span aria-hidden="true"> → </span></a> </li> </ul>

AuthorSpider page 1 # -*- coding: utf-8 -*- """ Created on Tue Feb 28 12:07:25 2017 @author: Scrapy Tutorial quotes example author_spider import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = ['http://quotes.toscrape.com/']

AuthorSpider page 2 def parse(self, response): # follow links to author pages for href in response.css('.author+a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_author) # follow pagination links next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

AuthorSpider page 3 def parse_author(self, response): def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), }

scrapy shell url | file

Response.attributes (right term) body copy css encoding flags headers meta replace request selector status test url urljoin xpath

Running the spider scrapy crawl quotes # notes quotes quotes_spider.py https://doc.scrapy.org/en/1.3/

scrapy genspider –l Basic Crawl Cvsfeed Xmlfeed

Basic spider template # -*- coding: utf-8 -*- import scrapy class BasicSpider(scrapy.Spider): name = "basic" allowed_domains = ["www.yahoo.com"] start_urls = ['http://www.yahoo.com/'] def parse(self, response): pass

Crawl spider template # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CrawlSpider(CrawlSpider): name = 'crawl' allowed_domains = ['www.yahoo.com'] start_urls = ['http://www.yahoo.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), )

Crawl spider template def parse_item(self, response): i = {} i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() i['name'] = response.xpath('//div[@id="name"]').extract() i['description'] = response.xpath('//div[@id="description"]').extract() return i

Csv spider template # -*- coding: utf-8 -*- from scrapy.spiders import CSVFeedSpider class CsvfeedSpider(CSVFeedSpider): name = 'csvfeed' allowed_domains = ['www.yahoo.com'] start_urls = ['http://www.yahoo.com/feed.csv'] # headers = ['id', 'name', 'description', 'image_link'] # delimiter = '\t' # Do any adaptations you need here #def adapt_response(self, response): # return response

Csv spider template continued def parse_row(self, response, row): i = {} #i['url'] = row['url'] #i['name'] = row['name'] #i['description'] = row['description'] return i

Fragile projects – break if website changes format Browser updates version (selenium only ???)

Project example from Stackoverflow http://stackoverflow.com/questions/39243009/scrapy-tutorial-example Fragile scraping project Diagnosis Run Run in scrapy shell view(response) – to see what the browser would show Fix searches to adjust to changes in website

Debugging Fragile scraper

Debugging Fragile scraper

Debugging Fragile scraper results

Debug xpath expressions with Scrapy Shell

Auxilliary lambda function to test xpaths Testing xpaths can be a lot of typing response.xpath('//ul[@class="directory-url"]/li') Instead define xp = lambda x: response.xpath(x).extract_first() Then instead you just have to type the path to test xp ('//ul[@class="directory-url"]/li')

Corrected version import scrapy class MozSpider(scrapy.Spider): name = "moz" allowed_domains = ["www.dmoz.org"] start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']

def parse(self, response): sites = response def parse(self, response): sites = response.xpath('//div[@class="title-and-desc"]') for site in sites: name = site.xpath('a/div[@class="site-title"]/text()').extract_first() url = site.xpath('a/@href').extract_first() description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip() yield{'Name':name, 'URL':url, 'Description':description}

Login then scrape # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class LoginSpider(CrawlSpider): name = 'login' allowed_domains = ['www.example.com'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), )

Login the scrape page 2 def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests # for each of them, with another callback pass def parse_item(self, response):

scrapy shell "www.yahoo.com/finance"

Finance.yahoo.com

Inspect Element Firefox/Firebug Position mouse over element then right-click

27.3. pdb — The Python Debugger Source code: Lib/pdb.py The module pdb defines an interactive source code debugger for Python programs. It supports setting (conditional) breakpoints and single stepping at the source line level, inspection of stack frames, source code listing, and evaluation of arbitrary Python code in the context of any stack frame. It also supports post-mortem debugging and can be called under program control. The debugger is extensible – it is actually defined as the class Pdb. This is currently undocumented but easily understood by reading the source. The extension interface uses the modules bdb and cmd. The debugger’s prompt is (Pdb).

Typical usage of pdb >>> import pdb >>> import mymodule >>> pdb.run('mymodule.test()') > <string>(0)?() (Pdb) continue > <string>(1)?() NameError: 'spam' (Pdb)

Xpath >>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape' https://doc.scrapy.org/en/1.3/