Download presentation
Presentation is loading. Please wait.
Published byGervais Webster Modified over 9 years ago
1
Packages: Scrapy, Beautiful Soup Scrapy Website http://doc.scrapy.org/en/latest/index.html http://doc.scrapy.org/en/latest/index.html http://scrapy.org http://scrapy.org Scrapy is an application framework for crawling web sites and extracting structured data. It can also be used to extract data using Application Programming Interfaces (APIs) (such as Amazon Associates Web Services and Twitter API)Amazon Associates Web Services Twitter API
2
A webpage: http://stackoverflow.com/questions?sort=voteshttp://stackoverflow.com/questions?sort=votes
3
A piece of python script
5
{"body": "... LONG HTML HERE...", "votes": "12842", "title": "Why is processing a sorted array faster than an unsorted array?", "link": "http://stackoverflow.com/questio ns/11227809/why-is-processing-a- sorted-array-faster-than-an- unsorted-array", "tags": ["java", "c++", "performance", "optimization", "branch-prediction"]}
6
The installation steps assume that you have the following things installed: Python 2.7 Python 2.7 pip and setuptools Python packages. pip and setuptools Python packages. lxml. lxml. OpenSSL. OpenSSL. You can install Scrapy using pip (which is the canonical way to install Python packages). http://doc.scrapy.org/en/latest/intro/install.html http://doc.scrapy.org/en/latest/intro/install.html
7
Install pip https://pip.pypa.io/en/latest/installing.html https://pip.pypa.io/en/latest/installing.html You might meet the problem
8
You can solve the problem by using other packages “Homebrew” or “MacPort” to substitute “PIP” Or, All you need to do is Then, you can see Next But
9
Type And input your password You will see After scrapy installed, you need type as following This will allow you to use all the goodies from Scrapy 1.0
10
Scrapy is controlled through the “scrapy” command-line tool. The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set of arguments and options.
12
Find the folder you want to store your spider, and type: >>> scrapy startproject projectname
13
>>> scrapy genspider spidername weblink
14
To define common output data format Scrapy provides the Item class. Item objects are simple containers used to collect the scraped data. Extract structured data from unstructured sources
15
>>> scrapy genspider spidername weblink This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). name, allowed_domains, start_urls, etc.
16
save the following sentences in a file named dmoz_spider.py under the spiders directory (“genspider”) parse():parse(): is in charge of processing the response and returning scraped data and/or more URLs to follow. This method, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objectsRequestItem objects Response: Response: object of each start URL Name of the spider webpages Write data to files
17
Go to the folder and type command
18
Check two files Resources.html Books.html
19
To extract data from the HTML source. There are several libraries available to achieve this: BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow. LXML is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.XPath CSS Xpath is a language for selecting nodes in XML documents, which can also be used with HTML.
20
Write data to files Extract content of the webpage Import package
21
Check two files Resources.txt Books.txt
22
HTML source code: selectors Extract the text of all elements from an HTML response body, returning a list of unicode strings
23
selectors with regular expressions selectors Regular expression is a sequence of characters that define a search pattern, mainly used for pattern matching with strings
25
xpath(query) Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too. css(query) Apply the given CSS selector and return a SelectorList instance. extract() Serialize and return the matched nodes as a list of unicode strings. re() Apply the given regex and return a list of unicode strings with the matches. register_namespace(prefix, url) Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces. remove_namespaces() Remove all namespaces, allowing to traverse the document using namespace-less xpaths. __nonzero__() Returns True if there is any real Content selected or False otherwise.
26
Test if the extracted data is as we expect : >>>scrapy shell url E.g.: scrapy shell http://www.dmoz.org/Computers/Program ming/Languages/Python/Books/
28
Select source code of webpage Import packages Searching for structured data and extracting
30
http://doc.scrapy.org/en/latest/index.html http://doc.scrapy.org/en/latest/index.html http://www.w3.org/TR/xpath/ http://www.w3.org/TR/xpath/ //
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.