Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Scraping Lecture7 - Topics Readings: More Beautiful Soup Crawling

Similar presentations


Presentation on theme: "Web Scraping Lecture7 - Topics Readings: More Beautiful Soup Crawling"— Presentation transcript:

1 Web Scraping Lecture7 - Topics Readings: More Beautiful Soup Crawling
Crawling by Hand Readings: Chapter 3 January 26, 2017

2 Overview Last Time: Lecture 6 Slides 1-29 Today:
BeautifulSoup revisited Crawling Today: Chapter 3: Lecture 6 Slides 29-40 3-crawlSite.py - 4-getExternalLinks.py – 5-getAllExternalLinks.py – Warnings Chapter 4 APIs JSON Javascript References Scrapy site:

3 Pythex.org -- Revisited

4 Getting the code from the text again!

5 Warnings Chapter 2 – pp 26 Regular Expressions are not always regular
pp 35 Handle your exceptions pp 38 Recursion limit – depth limit (ridiculous) pp 40 multiple elements in a try block might lead to confusion as to which caused the exception pp 41 “Unknown Waters Ahead” – be prepared to run into sights that are not respectable pp 43 “Don’t put examples programs into production”

6 Regular Expressions are not always regular
ls a*c // Unix command line; dir T*c // Windows cmd POSIX Standard sh, bash, csh perl

7 Recursion limit = 1000 #Ackermann def ackermann(m, n): if m == 0: return(n+1) elif m > 0 and n == 0: return(ackermann(m-1, 1)) elif m > 0 and n > 0: return(ackermann(m-1, ackermann(m, n-1))) else: print("Should not reach here, unless bad arguments are passed.") print ("ackermann(3,5)=", ackermann(3,5)) print ("ackermann(4,2)=", ackermann(4,2)) #fibonnacci def fib (current=1, previous=0, limit=100): new = current + previous print (new) if new < limit: fib (previous, new, limit) fib(1,0, ) print ("completed")

8 Anaconda3 sudo apt-get install anaconda3

9 Crawling with Scrapy “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” Tutorial Download pip install scrapy

10

11 Walk-through of the example spider
Starting the example $ scrapy startproject wikiSpider “You can start your first spider with: $ cd wikiSpider $ scrapy genspider example example.com

12 Walk-through of an example spider
$ scrapy startproject wikiSpider Configuration file for our scrapy projects Code directory for our new project

13 from scrapy.selector import Selector from scrapy import Spider from wikiSpider.items import Article class ArticleSpider( Spider): name =" article" allowed_domains = [" en.wikipedia.org"] start_urls = [" en.wikipedia.org/ wiki/ Main_Page", " en.wikipedia.org/ wiki/ Python_% 28programming_language% 29"]

14 def parse( self, response): item = Article() title = response
def parse( self, response): item = Article() title = response.xpath('// h1/ text()')[ 0]. extract() print(" Title is: "+ title) item[' title'] = title return item

15 Running our crawler $ scrapy crawl article

16 Logging with Scrapy Add logging level to the file settings.py
LOG_LEVEL = ‘ERROR’ There are five levels of logging in Scrapy, listed in order here: CRITICAL ERROR WARNING DEBUG INFO $ scrapy crawl article -s LOG_FILE = wiki.log

17 Varying the format of the output
$ scrapy crawl article -o articles.csv -t csv $ scrapy crawl article -o articles.json -t json $ scrapy crawl article -o articles.xml -t xml

18 Chapter 4: Using APIs API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. A web API is an application programming interface (API) for either a web server or a web browser. Program request in HTML Response in XML or JSON

19 ECMA-404 The JSON Data Interchange Standard
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language JSON is built on two structures: A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

20 JSON objects Similar to python dictionaries Example { “first” : “John”, “last” : “Donne”, “Phone number” : }

21 JSON arrays and values

22 JSON strings

23 JSON Numbers Note no Hex and no octal

24 FreeGeoIP – Where is this IP address?
{"ip":" ","country_code":"US","country_name":"United States","region_code":"MA","region_name":"Massachusetts","city":"Boston","zip_code":"02116","time_zone":"America/New_York","latitude": ,"longitude": ,"metro_code":506} {"ip":" ","country_code":"US","country_name":"United States","region_code":"SC","region_name":"South Carolina","city":"Columbia","zip_code":"29208","time_zone":"America/New_York","latitude": ,"longitude":-81.02,"metro_code":546}

25 HTTP protocol revisited -- History
The term hypertext was coined by Ted Nelson in 1965 in the Xanadu Project, which was in turn inspired by Vannevar Bush's vision (1930s) of the microfilm-based information retrieval and management "memex" system described in his essay As We May Think (1945). Tim Berners-Lee and his team at CERN are credited with inventing the original HTTP along with HTML and the associated technology for a web server and a text-based web browser. Berners-Lee first proposed the "WorldWideWeb" project in 1989 — now known as the World Wide Web. The first version of the protocol had only one method, namely GET, which would request a page from a server.[3] The response from the server was always an HTML page.[4]

26 Get and Head Packets GET The GET method requests a representation of the specified resource. Requests using GET should only retrieve data and should have no other effect. HEAD The HEAD method asks for a response identical to that of a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.

27 Post and Put Packets POST The POST method requests that the server accept the entity enclosed in the request as a new subordinate of the web resource identified by the URI. The data POSTed might be, for example, an annotation for existing resources; a message for a bulletin board, newsgroup, mailing list, or comment thread; a block of data that is the result of submitting a web form to a data-handling process; or an item to add to a database.[14] PUT The PUT method requests that the enclosed entity be stored under the supplied URI. If the URI refers to an already existing resource, it is modified; if the URI does not point to an existing resource, then the server can create the resource with that URI.[15]

28 DELETE The DELETE method deletes the specified resource
DELETE The DELETE method deletes the specified resource. TRACE The TRACE method echoes the received request so that a client can see what (if any) changes or additions have been made by intermediate servers. OPTIONS The OPTIONS method returns the HTTP methods that the server supports for the specified URL. This can be used to check the functionality of a web server by requesting '*' instead of a specific resource. CONNECT [16] The CONNECT method converts the request connection to a transparent TCP/IP tunnel, usually to facilitate SSL-encrypted communication (HTTPS) through an unencrypted HTTP proxy.[17][18] See HTTP CONNECT tunneling. PATCH The PATCH method applies partial modifications to a resource.[ Other HTTP commands

29 Authentication Identify users – for charges etc. developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 Using urlopen token = "< your api key >" webRequest = urllib.request.Request(" myapi.com", headers ={" token": token}) html = urlopen( webRequest)

30 XML versus JSON XML < user > < firstname > Ryan </ firstname > < lastname > Mitchell </ lastname > < username > Kludgist </ username > </ user > which clocks in at 98 characters,   JSON: {" user":{" firstname":" Ryan"," lastname":" Mitchell"," username":" Kludgist"}} which clocks in at 73 characters

31 XML versus JSON; now prettified
XML < user > < firstname > Ryan </ firstname > < lastname > Mitchell </ lastname > < username > Kludgist </ username > </ user > JSON: {" user": {“firstname” : “Ryan“ , “lastname“ : “Mitchell“ , “username“ : “Kludgist”} }

32 Syntax of API calls When retrieving data through a GET request, the URL path describes how you would like to drill down into the data, while the query parameters serve as filters or additional requests tacked onto the search. Example

33 Echo Nest Web scrapes to identify music instead of human tagging like Pandora developer.echonest.com/ api/ v4/ artist/ search? api_key = < your api key >& name = monty% 20python This produces the following result: {" response": {" status": {" version": "4.2", "code": 0, "message": "Success"}, "artists": [{" id": "AR5HF791187B9ABAF4", "name": "Monty Pytho n"}, {" id": "ARWCIDE13925F19A33", "name": "Monty Python's SPAMALOT"}, {" id": "ARVPRCC12FE ", "name": "Monty Python's Graham Chapman" }]}} Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

34 Twitter Twitter is notoriously protective of its API and rightfully so. With over 230 million active users and a revenue of over $ 100 million a month, the company is hesitant to let just anyone come along and have any data they want.   Twitter’s rate limits (the number of calls it allows each user to make) fall into two categories: 15 calls per 15-minute period, and 180 calls per 15-minute period, depending on the type of call. For instance, you can make up to 12 calls a minute to retrieve basic information about Twitter users, but only one call a minute to retrieve lists of those users’ Twitter followers.

35

36 Google Developers APIs

37

38

39 Yield in Python def _get_child_candidates(self, distance, min_dist, max_dist): if self._leftchild and distance - max_dist < self._median: yield self._leftchild if self._rightchild and distance + max_dist >= self._median: yield self._rightchild result, candidates = list(), [self] while candidates: node = candidates.pop() distance = node._get_dist(obj) if distance <= max_dist and distance >= min_dist: result.extend(node._values) candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) return result

40 Iterators and Generators
When you create a list, you can read its items one by one. Reading its items one by one is called iteration: >>> mylist = [1, 2, 3] >>> for i in mylist: ... print(i) Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: >>> mygenerator = (x*x for x in range(3)) >>> for i in mygenerator:

41 Yield Yield is a keyword that is used like return, except the function will return a generator. >>> def createGenerator(): ... mylist = range(3) ... for i in mylist: ... yield i*i ... >>> mygenerator = createGenerator() # create a generator >>> print(mygenerator) # mygenerator is an object! <generator object createGenerator at 0xb7555c34> >>> for i in mygenerator: ... print(i)

42 Next Time Requests Library and DB
Requests: HTTP for Humans >>> r = requests.get(' auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...' >>> r.json() {u'private_gists': 419, u'total_private_repos': 77, ...}


Download ppt "Web Scraping Lecture7 - Topics Readings: More Beautiful Soup Crawling"

Similar presentations


Ads by Google