Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scrapy Web Cralwer Instructor: Bei Kang.

Similar presentations


Presentation on theme: "Scrapy Web Cralwer Instructor: Bei Kang."— Presentation transcript:

1 Scrapy Web Cralwer Instructor: Bei Kang

2 What is web crawling? Web crawling is a computer software technique of extracting information from websites. This is accomplished by either directly implementing the Hypertext Transfer Protocol (on which the Web is based), or embedding a web browser.

3 Web crawling tools Lots of tools, frameworks and online services…
Python web scraping frameworks: Scrapy beautifulsoup4 selenium

4 Scrapy Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. In a fast, simple, yet extensible way.

5 Why Use Scrapy? It is easier to build and scale large crawling projects. It has a built-in mechanism called Selectors, for extracting the data from websites. It handles the requests asynchronously and it is fast. It automatically adjusts crawling speed using Auto- throttling mechanism. Ensures developer accessibility.

6 Features of Scrapy Scrapy is an open source and free to use web crawling framework. Scrapy generates feed exports in formats such as JSON, CSV, and XML. Scrapy has built-in support for selecting and extracting data from sources either by XPath or CSS expressions. Scrapy based on crawler, allows extracting data from the web pages automatically.

7 Advantages of Using Scrapy
Scrapy is easily extensible, fast, and powerful. It is a cross-platform application framework (Windows, Linux, Mac OS and BSD). Scrapy requests are scheduled and processed asynchronously. Scrapy comes with built-in service called Scrapyd which allows to upload projects and control spiders using JSON web service. It is possible to scrap any website, though that website does not have API for raw data access.

8

9 Scrapy Architecture The Engine gets the initial Requests to crawl from the Spider. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl. The Scheduler returns the next Requests to the Engine. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()). Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()). The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()). The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()). The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl. The process repeats (from step 1) until there are no more requests from the Scheduler.

10 Install Scrapy To install scrapy, run the following command −
pip install Scrapy Then try > scrapy shell

11 Scrapy Fetch Under scrapy shell,
1. type fetch(" 2. then type: view(response) 3. type: print response.text 4. type: response.css(".title::text").extract()

12 Create a Scapy Project scrapy startproject ourfirstscraper

13 Project File Structure

14 Create a Scapy Project For now, the two most important files are:
settings.py – This file contains the settings you set for your project, you’ll be dealing a lot with it. spiders/ – This folder is where all your custom spiders will be stored. Every time you ask scrapy to run a spider, it will look for it in this folder.

15 Creating a spider scrapy genspider redditbot This will create a new spider “redditbot.py” in your spiders/ folder with a basic template.

16 def parse(self, response):
#Extracting the content using css selectors titles = response.css('.title.may-blank::text').extract() votes = response.css('.score.unvoted::text').extract() times = response.css('time::attr(title)').extract() comments = response.css('.comments::text').extract() #Give the extracted content row wise for item in zip(titles,votes,times,comments): #create a dictionary to store the scraped info scraped_info = { 'title' : item[0], 'vote' : item[1], 'created_at' : item[2], 'comments' : item[3], } #yield or give the scraped info to scrapy yield scraped_info

17 Check the output >> scrapy crawl redditbot
If you want to store the data into formatted file, you can follow the trick below: Exporting scraped data as a csv Open the settings.py file and add the following code to it: #Export as CSV Feed FEED_FORMAT = "csv" FEED_URI = "reddit.csv“ Then run scrapy crawl redditbot Or without editing settings.py, you can simply do: scrapy crawl redditbot -o reddit.csv

18 Define Items In items.py, you can add the following lines:
titles = scrapy.Field() votes = scrapy.Field() times = scrapy.Field() comments = scrapy.Field() Then in redditbot.py, change parse function into item[‘titles’] = title item[‘votes’] = votes item[‘times’] = times item[‘comments’] = comments yield item

19 Note If in case some website doesn’t allow simple no-header request, we will need to use start_request instead of start_url to make the request headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/ (KHTML, like Gecko) Chrome/ Safari/537.36', } def start_requests(self): url = ‘ yield Request(url, headers=self.headers)

20 Try to download craigslist data using a scrapy spider
Creating a Scrapy Spider In your Terminal, navigate to the folder of the Scrapy project we have created in the previous step. As we called it craigslist, the folder would be with the same name and the command should simply be: >> cd craigslist >> scrapy genspider jobs

21 Try to download craigslist data using a scrapy spider
# -*- coding: utf-8 -*- import scrapy class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["craigslist.org"] start_urls = [' def parse(self, response): pass

22 Try to download craigslist data using a scrapy spider
Editing the parse() Function Instead of pass, add this line to the parse() function: titles = hdrlnk"]/text()').extract()

23 Try to download craigslist data using a scrapy spider
titles is a [list] of text portions extracted based on a rule. response is simply the whole html source code retrieved from the page. Actually, “response” has a deeper meaning because if you print(response) you will get something like < which means “you have managed to connect to this web page”; however, if you print(response.body) you will get the whole source code. Anyhow, when you use XPath expressions to extract HTML nodes, you should directly use response.xpath() xpath is how we will extract portions of text and it has rules. XPath is a detailed topic and we will dedicate a separate article for it. But generally try to notice the following:

24 Try to download craigslist data using a scrapy spider
Open the URL in your browser, move the cursor on any job title, right-click, and select “Inspect“. You can see now the HTML code like this: <a href="/brk/egr/ html" data-id=" " class="result-title hdrlnk">Chief Engineer</a> So, you want to extract “Chief Engineer” which is the text of an <a> tag, and as you can see this <a> tag has the class “result-title hdrlnk” which can distinguish it from other <a> tags on the web-page. Let’s explain the XPath rule we have: // means instead of starting from the <html>, just start from the tag that I will specify after it. /a simply refers to the <a> tag. hdrlnk"] that is directly comes after /a means the <a> tag must have this class name in it. text() refers to the text of the <a> tag, which is”Chief Engineer”. extract() means extract every instance on the web page that follows the same XPath rule into a [list].

25 Try to download craigslist data using a scrapy spider
# -*- coding: utf-8 -*- import scrapy class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["craigslist.org"] start_urls = [' def parse(self, response): titles = hdrlnk"]/text()').extract() for title in titles:             yield {'Title': title} scrapy crawl jobs -o job_titles.csv


Download ppt "Scrapy Web Cralwer Instructor: Bei Kang."

Similar presentations


Ads by Google