Scrapy Web Cralwer Instructor: Bei Kang.

Scrapy Web Cralwer Instructor: Bei Kang

Try to download craigslist data using a scrapy spider
Creating a Scrapy Spider In your Terminal, navigate to the folder of the Scrapy project we have created in the previous step. As we called it craigslist, the folder would be with the same name and the command should simply be: >> cd craigslist >> scrapy genspider findingjobs

# -*- coding: utf-8 -*- import scrapy class FindingJobsSpider(scrapy.Spider): name = “findingjobs" allowed_domains = ["craigslist.org"] start_urls = [' def parse(self, response): pass

Editing the parse() Function Instead of pass, add this line to the parse() function: titles = response.css(‘.hdrlnk::text').extract()

titles is a [list] of text portions extracted based on a rule. response is simply the whole html source code retrieved from the page. Actually, “response” has a deeper meaning because if you print(response) you will get something like < which means “you have managed to connect to this web page”; however, if you print(response.body) you will get the whole source code. Anyhow, when you use CSS expressions to extract HTML nodes, you should directly use response.css()

Open the URL in your browser, move the cursor on any job title, right-click, and select “Inspect“. You can see now the HTML code like this: <a href="/brk/egr/ html" data-id=" " class="result-title hdrlnk">Part time accountant for tea business</a> So, you want to extract “Part time accountant for tea business” which is the text of an <a> tag, and as you can see this <a> tag has the class “result-title hdrlnk” which can distinguish it from other <a> tags on the web-page. Using CSS grammer we can extract the text in hdrlnk and store it into the response.

# -*- coding: utf-8 -*- import scrapy class FindingJobsSpider(scrapy.Spider): name = “findingjobs" allowed_domains = ["craigslist.org"] start_urls = [' def parse(self, response): titles = response. .css('.hdrlnk::text').extract() for title in titles: yield {'Title': title} scrapy crawl findingjobs -o job_titles.csv

Now, please try the following function: def parse(self, response): blocks = response.css('.result-info') for block in blocks: title = block.css('.hdrlnk::text').extract() datetime = block.css('time::attr(datetime)').extract() url = block.css('a[href*=https]::attr(href)').extract() location = block.css('.result-hood::text').extract() yield {'Title':title, 'DateTime': datetime, 'Location': location, 'URL' : url} scrapy crawl findingjobs -o job_details.csv

Scrapy Web Cralwer Instructor: Bei Kang.

Similar presentations

Presentation on theme: "Scrapy Web Cralwer Instructor: Bei Kang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scrapy Web Cralwer Instructor: Bei Kang.

Similar presentations

Presentation on theme: "Scrapy Web Cralwer Instructor: Bei Kang."— Presentation transcript:

Similar presentations

About project

Feedback