Basic Web Scraping with Python

Basic Web Scraping with Python
Everything you need to get data from the web

Python is great for collecting data from the web
Python is unique as a programming language, in that it has very strong communities for both web programming and data management/analysis. As a result, your entire computational research pipeline can be integrated. Instead of having separate applications for data collection, storage, management, analysis and visualization, you have one code-base.

Approaches to Web Scraping in Python
There are two primary approaches to web scraping in Python: Customize a canned spider using ScraPy Create a fully custom spider using requests, lxml, sqlalchemy and celery In general, unless you’re trying to do something really unusual - such as distributed, high throughput crawling - ScraPy is the right choice.

How should I manage my data?
Before we start talking about the specifics of each approach, we need to address our data management strategy. The right answer is basically the same for everyone: store it in a PostgreSQL database. PostgreSQL is available for all operating systems, and is easy to set up. Research computing also offers managed PostgreSQL databases. You will want to create a table for each scraping project, with columns for the URL of the scraped resource, the date and time it was scraped, and the various data fields you are collecting. If the data you’re collecting doesn’t fit easily into a standard field, you can create a JSON column and store it there.

Creating a spider using ScraPy
Once you’ve installed ScraPy, the first step is to create a new project: $ scrapy startproject <your_project_name> Next, change directory to the newly created project directory, and create a spider: $ cd <your_project_name> $ scrapy genspider <name> <domain>

Creating a spider using ScraPy (continued…)
Once that’s done, a directory will be created with the name of your project, and below that, a directory called spiders, with a python file having the name of the spider you generated. If you open it up it will look something like this: import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [' def parse(self, response): pass

In most cases, all you need to do is add data and link extraction code to the parse method, for example: def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in yield scrapy.Request(url, callback=self.parse)

If you want to add custom code to generate starting URLs for scraping, you can do that by overriding the start_requests method, like this: def start_requests(self): yield scrapy.Request(' self.parse) yield scrapy.Request(' self.parse) yield scrapy.Request(' self.parse)

ScraPy canned spider templates
ScraPy includes a number of spider templates that make building scrapers for common types of sites and data formats easy, including: CrawlSpider - generic spider, allows you to specify rules for following links and extracting items rather than write extraction code. XMLFeedSpider - spider for crawling XML feeds. CSVFeedSpider - like the XMLFeedSpider, but for CSV document feeds. SitemapSpider - spider that crawls based on links listed in a sitemap.

Using ScraPy spiders Once you’ve set up your ScraPy project, the next step (from the project directory) is to test and run it. You can test your extraction logic by running the command: scrapy shell -- spider <spider> <url>. This will start an interactive spider session, so you can verify that your logic is extracting data and links from the page as intended. Once you’re satisfied that your spider is working properly, you can crawl a site by running the command: scrapy crawl <spider> -o results.json.

ScraPy shell The ScraPy shell is a good place to verify that you’ve written your spider correctly. From the shell you can test your parse function and extraction code like so: # Produce a list of all links and data extracted from the target url list(spider.parse(response)) # You can also test xpath selectors here for anchor_text in response.xpath(“//a/text()”).extract(): print anchor_text

What if ScraPy doesn’t do what you need?
First, it is worth your time to double check, ScraPy is very configurable. You want to avoid reinventing the wheel if at all possible. If you are absolutely certain ScraPy isn’t up to the task, the next step is a custom spider. For this, you will need Requests, LXML, SQL Alchemy and maybe Celery.

Libraries used to create a custom spider
Requests - This is Python’s best HTTP request library. You will be using it to fetch web pages using get/post/etc requests. LXML - This is Python’s best HTML/XML parsing and processing library. You will be using it to extract data and links from the fetched web pages. SQL Alchemy - This is Python’s best database library. You will be using it to write scraping results into a database. Celery - This is a task queue library. Using Celery is optional, unless you want multiple spiders running on the same domain simultaneously.

Requests Using requests is drop dead simple. For example:
>>> import requests >>> r = requests.get(' >>> r.text u'[{"repository":{"open_issues":0,"url":" >>> r = requests.post(' data = {'key':'value'}) >>> requests.get(' auth=('user', 'pass')) <Response [200]> That is pretty much all there is to it!

LXML LXML is a large, powerful library that can take a while to master. You will need to become familiar with xpath in order to take advantage of LXML’s power. from lxml import etree # here r is the request object we got in the previous example document = etree.fromstring(r.text) anchor_tags = document.xpath(“//a”) hrefs = [a.attrib.get(“href”) for a in anchor_tags] anchor_text = [a.xpath(“text()”) for a in anchor_tags]

SQL Alchemy SQL Alchemy is a large, powerful library for interacting with databases in Python. There is a lot to learn, but initially you just need to know how to insert data. import sqlalchemy engine = connection = engine.connect() tables = sqlalchemy.MetaData(bind=engine, reflect=True).tables connection.execute(tables[“my_table”].insert(), column_1=”random”, column_2=”data”)

Celery This part is optional. You don’t need celery unless you have several spiders running on a site simultaneously. Celery provides you with a task queue. This queue might be all the pages you want to scrape. You would then have tasks that scrape a single page, extract data from it and push any found links to the task queue. You can add jobs to the task queue in one part of your program, and separate worker processes complete jobs from the queue. Celery requires a separate RabbitMQ (or equivalent) server.

Celery Worker Celery uses worker processes to handle jobs from the queue: worker.py: from celery import Celery app = Celery('tasks', @app.task def scrape(url): text = requests.get(url).text data = extract_data(text) add_data_to_database(data) for link in extract_links(text): scrape.delay(link)

Running a Celery worker
If you want your tasks to be completed, you need to have one or more worker processes running. Starting a worker process is as simple as: $ celery -A tasks worker

Basic Web Scraping with Python

Similar presentations

Presentation on theme: "Basic Web Scraping with Python"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basic Web Scraping with Python

Similar presentations

Presentation on theme: "Basic Web Scraping with Python"— Presentation transcript:

Similar presentations

About project

Feedback