Download presentation
Presentation is loading. Please wait.
1
590 Scraping – NER shape features
Topics Scrapy – items.py Readings: Srapy documentation April 4, 2017
2
Today Scrapers from scrapy_documentation Cleaning NLTK data
loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites
3
Scrapy notes Focused narrow scrape (one domain)
Broad scrapes – better suited to Dealing with javascript in scrapy
4
Selenium and Scrapy from scrapy.http import HtmlResponse from selenium import webdriver class JSMiddleware(object): def process_request(self, request, spider): driver = webdriver.PhantomJS() driver.get(request.url) body = driver.page_source return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
5
Cfg file Populating the settings Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence: Command line options (most precedence) Settings per-spider Project settings module Default settings per-command Default global settings (less precedence)
6
Command line settings scrapy crawl myspider -s LOG_FILE=scrapy.log
7
DEPTH_LIMIT Default: 0 Scope: scrapy. spidermiddlewares. depth
DEPTH_LIMIT Default: 0 Scope: scrapy.spidermiddlewares.depth.DepthMiddleware The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
8
DEPTH_PRIORITY Default: 0 Scope: scrapy.spidermiddlewares.depth.DepthMiddleware An integer that is used to adjust the request priority based on its depth: if zero (default), no priority adjustment is made from depth a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO) a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO) See also: Does Scrapy crawl in breadth-first or depth-first order? about tuning Scrapy for BFO or DFO.
9
DOWNLOAD_DELAY Default: 0 The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example: DOWNLOAD_DELAY = 0.25 # 250 ms of delay This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.
10
Conifers
11
Items.py import scrapy class ConifersItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() genus = scrapy.Field() species = scrapy.Field() pass
12
Middleware.py from scrapy import signals class ConifersSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None
13
def process_spider_output(response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass
14
def process_start_requests(start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
15
BOT_NAME = 'conifers' SPIDER_MODULES = ['conifers
BOT_NAME = 'conifers' SPIDER_MODULES = ['conifers.spiders'] NEWSPIDER_MODULE = 'conifers.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'conifers (+ # Obey robots.txt rules ROBOTSTXT_OBEY = True
16
# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
17
Pipelines.py # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: class ConifersPipeline(object): def process_item(self, item, spider): return item
18
coniferSpider from conifers.items import ConifersItem class ConiferSpider(scrapy.Spider): name = "conifer" allowed_domains = ["greatplantpicks.org"] start_urls = [' def parse(self, response): #filename = response.url.split("/")[-2] + '.html' filename = 'conifers' + '.html' with open(filename, 'wb') as f: f.write(response.body) pass
19
import scrapy from conifers. items import ConifersItem #from scrapy
import scrapy from conifers.items import ConifersItem #from scrapy.selector import Selector #from scrapy.http import HtmlResponse class ConifersextractSpider(scrapy.Spider): name = "conifersExtract" allowed_domains = ["greatplantpicks.org"] start_urls = ['
20
def parse(self, response): for sel in response
def parse(self, response): for sel in response.xpath('//tbody/tr'): item = ConifersItem() item['name']= text()').extract() item['genus'] = item['species'] = yield item
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.