Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  

Similar presentations


Presentation on theme: " Packages:  Scrapy, Beautiful Soup  Scrapy  Website  "— Presentation transcript:

1  Packages:  Scrapy, Beautiful Soup  Scrapy  Website  http://doc.scrapy.org/en/latest/index.html http://doc.scrapy.org/en/latest/index.html  http://scrapy.org http://scrapy.org  Scrapy is an application framework for crawling web sites and extracting structured data. It can also be used to extract data using Application Programming Interfaces (APIs) (such as Amazon Associates Web Services and Twitter API)Amazon Associates Web Services Twitter API

2 A webpage: http://stackoverflow.com/questions?sort=voteshttp://stackoverflow.com/questions?sort=votes

3 A piece of python script

4

5 {"body": "... LONG HTML HERE...", "votes": "12842", "title": "Why is processing a sorted array faster than an unsorted array?", "link": "http://stackoverflow.com/questio ns/11227809/why-is-processing-a- sorted-array-faster-than-an- unsorted-array", "tags": ["java", "c++", "performance", "optimization", "branch-prediction"]}

6  The installation steps assume that you have the following things installed:  Python 2.7 Python 2.7  pip and setuptools Python packages. pip and setuptools Python packages.  lxml. lxml.  OpenSSL. OpenSSL.  You can install Scrapy using pip (which is the canonical way to install Python packages).  http://doc.scrapy.org/en/latest/intro/install.html http://doc.scrapy.org/en/latest/intro/install.html

7  Install pip  https://pip.pypa.io/en/latest/installing.html https://pip.pypa.io/en/latest/installing.html  You might meet the problem

8  You can solve the problem by using other packages “Homebrew” or “MacPort” to substitute “PIP”  Or, All you need to do is  Then, you can see  Next  But

9  Type  And input your password  You will see  After scrapy installed, you need type as following  This will allow you to use all the goodies from Scrapy 1.0

10  Scrapy is controlled through the “scrapy” command-line tool.  The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set of arguments and options.

11

12  Find the folder you want to store your spider, and type:  >>> scrapy startproject projectname

13 >>> scrapy genspider spidername weblink

14 To define common output data format Scrapy provides the Item class. Item objects are simple containers used to collect the scraped data. Extract structured data from unstructured sources

15  >>> scrapy genspider spidername weblink  This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself).  name, allowed_domains, start_urls, etc.

16  save the following sentences in a file named dmoz_spider.py under the spiders directory (“genspider”) parse():parse(): is in charge of processing the response and returning scraped data and/or more URLs to follow. This method, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objectsRequestItem objects Response: Response: object of each start URL Name of the spider webpages Write data to files

17 Go to the folder and type command

18  Check two files  Resources.html  Books.html

19  To extract data from the HTML source. There are several libraries available to achieve this:  BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.  LXML is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree  Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.XPath CSS  Xpath is a language for selecting nodes in XML documents, which can also be used with HTML.

20 Write data to files Extract content of the webpage Import package

21  Check two files  Resources.txt  Books.txt

22 HTML source code: selectors Extract the text of all elements from an HTML response body, returning a list of unicode strings

23 selectors with regular expressions selectors Regular expression is a sequence of characters that define a search pattern, mainly used for pattern matching with strings

24

25  xpath(query)  Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too.  css(query)  Apply the given CSS selector and return a SelectorList instance.  extract()  Serialize and return the matched nodes as a list of unicode strings.  re()  Apply the given regex and return a list of unicode strings with the matches.  register_namespace(prefix, url)  Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces.  remove_namespaces()  Remove all namespaces, allowing to traverse the document using namespace-less xpaths.  __nonzero__()  Returns True if there is any real Content selected or False otherwise.

26  Test if the extracted data is as we expect : >>>scrapy shell url E.g.: scrapy shell http://www.dmoz.org/Computers/Program ming/Languages/Python/Books/

27

28 Select source code of webpage Import packages Searching for structured data and extracting

29

30  http://doc.scrapy.org/en/latest/index.html http://doc.scrapy.org/en/latest/index.html  http://www.w3.org/TR/xpath/ http://www.w3.org/TR/xpath/ //

31


Download ppt " Packages:  Scrapy, Beautiful Soup  Scrapy  Website  "

Similar presentations


Ads by Google