590 Scraping – NER shape features Topics Scrapy – items.py Readings: Srapy documentation April 4, 2017
Today Scrapers from scrapy_documentation Cleaning NLTK data loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites
Scrapy notes Focused narrow scrape (one domain) Broad scrapes – better suited to Dealing with javascript in scrapy
Selenium and Scrapy from scrapy.http import HtmlResponse from selenium import webdriver class JSMiddleware(object): def process_request(self, request, spider): driver = webdriver.PhantomJS() driver.get(request.url) body = driver.page_source return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
Cfg file Populating the settings Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence: Command line options (most precedence) Settings per-spider Project settings module Default settings per-command Default global settings (less precedence)
Command line settings scrapy crawl myspider -s LOG_FILE=scrapy.log
DEPTH_LIMIT Default: 0 Scope: scrapy. spidermiddlewares. depth DEPTH_LIMIT Default: 0 Scope: scrapy.spidermiddlewares.depth.DepthMiddleware The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY Default: 0 Scope: scrapy.spidermiddlewares.depth.DepthMiddleware An integer that is used to adjust the request priority based on its depth: if zero (default), no priority adjustment is made from depth a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO) a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO) See also: Does Scrapy crawl in breadth-first or depth-first order? about tuning Scrapy for BFO or DFO.
DOWNLOAD_DELAY Default: 0 The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example: DOWNLOAD_DELAY = 0.25 # 250 ms of delay This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.
Conifers
Items.py import scrapy class ConifersItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() genus = scrapy.Field() species = scrapy.Field() pass
Middleware.py from scrapy import signals class ConifersSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None
def process_spider_output(response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass
def process_start_requests(start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
BOT_NAME = 'conifers' SPIDER_MODULES = ['conifers BOT_NAME = 'conifers' SPIDER_MODULES = ['conifers.spiders'] NEWSPIDER_MODULE = 'conifers.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'conifers (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
Pipelines.py # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class ConifersPipeline(object): def process_item(self, item, spider): return item
coniferSpider from conifers.items import ConifersItem class ConiferSpider(scrapy.Spider): name = "conifer" allowed_domains = ["greatplantpicks.org"] start_urls = ['http://greatplantpicks.org/by_plant_type/conifer'] def parse(self, response): #filename = response.url.split("/")[-2] + '.html' filename = 'conifers' + '.html' with open(filename, 'wb') as f: f.write(response.body) pass
import scrapy from conifers. items import ConifersItem #from scrapy import scrapy from conifers.items import ConifersItem #from scrapy.selector import Selector #from scrapy.http import HtmlResponse class ConifersextractSpider(scrapy.Spider): name = "conifersExtract" allowed_domains = ["greatplantpicks.org"] start_urls = ['http://www.greatplantpicks.org/plantlists/by_plant_type/conifer']
def parse(self, response): for sel in response def parse(self, response): for sel in response.xpath('//tbody/tr'): item = ConifersItem() item['name']= sel.xpath('td[@class="common-name"]/a/ text()').extract() item['genus'] = sel.xpath('td[@class="plantname"]/a/span[@class="genus"]/text()').extract() item['species'] = sel.xpath('td[@class="plantname"]/a/span[@class="species"]/text()').extract() yield item