Download presentation
Presentation is loading. Please wait.
1
Scrapy Web Cralwer Instructor: Bei Kang
2
Try to download craigslist data using a scrapy spider
Creating a Scrapy Spider In your Terminal, navigate to the folder of the Scrapy project we have created in the previous step. As we called it craigslist, the folder would be with the same name and the command should simply be: >> cd craigslist >> scrapy genspider findingjobs
3
Try to download craigslist data using a scrapy spider
# -*- coding: utf-8 -*- import scrapy class FindingJobsSpider(scrapy.Spider): name = “findingjobs" allowed_domains = ["craigslist.org"] start_urls = [' def parse(self, response): pass
4
Try to download craigslist data using a scrapy spider
Editing the parse() Function Instead of pass, add this line to the parse() function: titles = response.css(‘.hdrlnk::text').extract()
5
Try to download craigslist data using a scrapy spider
titles is a [list] of text portions extracted based on a rule. response is simply the whole html source code retrieved from the page. Actually, “response” has a deeper meaning because if you print(response) you will get something like < which means “you have managed to connect to this web page”; however, if you print(response.body) you will get the whole source code. Anyhow, when you use CSS expressions to extract HTML nodes, you should directly use response.css()
6
Try to download craigslist data using a scrapy spider
Open the URL in your browser, move the cursor on any job title, right-click, and select “Inspect“. You can see now the HTML code like this: <a href="/brk/egr/ html" data-id=" " class="result-title hdrlnk">Part time accountant for tea business</a> So, you want to extract “Part time accountant for tea business” which is the text of an <a> tag, and as you can see this <a> tag has the class “result-title hdrlnk” which can distinguish it from other <a> tags on the web-page. Using CSS grammer we can extract the text in hdrlnk and store it into the response.
7
Try to download craigslist data using a scrapy spider
# -*- coding: utf-8 -*- import scrapy class FindingJobsSpider(scrapy.Spider): name = “findingjobs" allowed_domains = ["craigslist.org"] start_urls = [' def parse(self, response): titles = response. .css('.hdrlnk::text').extract() for title in titles: yield {'Title': title} scrapy crawl findingjobs -o job_titles.csv
8
Try to download craigslist data using a scrapy spider
Now, please try the following function: def parse(self, response): blocks = response.css('.result-info') for block in blocks: title = block.css('.hdrlnk::text').extract() datetime = block.css('time::attr(datetime)').extract() url = block.css('a[href*=https]::attr(href)').extract() location = block.css('.result-hood::text').extract() yield {'Title':title, 'DateTime': datetime, 'Location': location, 'URL' : url} scrapy crawl findingjobs -o job_details.csv
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.