Web Scraping Lecture 8 – Storing Data Topics Storing data Downloading CSV, MySQL Readings: Chapters 5, and 4 February 2, 2017
Overview Last Time: Lecture 6 slides 30- end; Lecture 7 Slides 1-31 Crawling from Chapter 3: Lecture 6 Slides 29-40 Getting code again: https://github.com/REMitchell/python-scraping 3-crawlSite.py 4-getExternalLinks.py 5-getAllExternalLinks.py Chapter 4 APIs JSON Today: Iterators, generators and yield Javascript References - Scrapy site/user manual
Reg Expressions – Lookahead patterns (?=...) Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'. (?!...) Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. (?#...) A comment; the contents of the parentheses are simply ignored.
Chapter 4: Using APIs API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. https://en.wikipedia.org/wiki/Application_programming_interface A web API is an application programming interface (API) for either a web server or a web browser. Program request in HTML Response in XML or JSON
Authentication Identify users – for charges etc. http:// developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 Using urlopen token = "< your api key >" webRequest = urllib.request.Request(" http:// myapi.com", headers ={" token": token}) html = urlopen( webRequest)
Google Developers APIs
Mining the Social Web; so Twitter Later
Yield in Python def _get_child_candidates(self, distance, min_dist, max_dist): if self._leftchild and distance - max_dist < self._median: yield self._leftchild if self._rightchild and distance + max_dist >= self._median: yield self._rightchild result, candidates = list(), [self] while candidates: node = candidates.pop() distance = node._get_dist(obj) if distance <= max_dist and distance >= min_dist: result.extend(node._values) candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) return result https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/ http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do
Iterators and Generators When you create a list, you can read its items one by one. Reading its items one by one is called iteration: >>> mylist = [1, 2, 3] >>> for i in mylist: ... print(i) Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: >>> mygenerator = (x*x for x in range(3)) >>> for i in mygenerator: https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/
Yield Yield is a keyword that is used like return, except the function will return a generator. >>> def createGenerator(): ... mylist = range(3) ... for i in mylist: ... yield i*i ... >>> mygenerator = createGenerator() # create a generator >>> print(mygenerator) # mygenerator is an object! <generator object createGenerator at 0xb7555c34> >>> for i in mygenerator: ... print(i) https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/
#Chapter 4: 4-DecodeJson.py #Web Scraping with Python by Ryan Mitchell import json from urllib.request import urlopen def getCountry(ipAddress): response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8') responseJson = json.loads(response) return responseJson.get("country_code") print(getCountry("50.78.253.58"))
#Chapter 4: 5-jsonParsing.py import json jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}], "arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}' jsonObj = json.loads(jsonString) print(jsonObj.get("arrayOfNums")) print(jsonObj.get("arrayOfNums")[1]) print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number")) print(jsonObj.get("arrayOfFruits")[2].get("fruit"))
Wiki Editing Histories – from where?
The Map --
#Chapter 4: 6-wikiHistories.py from urllib.request import urlopen from urllib.request import HTTPError from bs4 import BeautifulSoup import datetime import json import random import re random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "html.parser") return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
def getHistoryIPs(pageUrl): #Format of revision history pages is: #http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=h istory pageUrl = pageUrl.replace("/wiki/", "") historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=hi story" print("history url is: "+historyUrl) html = urlopen(historyUrl) bsObj = BeautifulSoup(html, "html.parser") #finds only the links with class "mw-anonuserlink" which has IP addresses #instead of usernames ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"}) addressList = set() for ipAddress in ipAddresses: addressList.add(ipAddress.get_text()) return addressList
def getCountry(ipAddress): try: response = urlopen("http://freegeoip def getCountry(ipAddress): try: response = urlopen("http://freegeoip.net/json/"+ipAddress).read(). decode('utf-8') except HTTPError: return None responseJson = json.loads(response) return responseJson.get("country_code") links = getLinks("/wiki/Python_(programming_language)")
Output
Chapter 5 - Storing Data Files Csv files Json, xml
Downloading images : to copy or not As you are scraping do you download images or just store links? Advantages to not copying? Scrapers run much faster, and require much less bandwidth, when they don’t have to download files. You save space on your own machine by storing only the URLs. It is easier to write code that only stores URLs and doesn’t need to deal with additional file downloads. You can lessen the load on the host server by avoiding large file downloads.
Advantages to not copying? Embedding these URLs in your own website or application is known as hotlinking and doing it is a very quick way to get you in hot water on the Internet. You do not want to use someone else’s server cycles to host media for your own applications. The file hosted at any particular URL is subject to change. This might lead to embarrassing effects if, say, you’re embedding a hotlinked image on a public blog. If you’re storing the URLs with the intent to store the file later, for further research, it might eventually go missing or be changed to something completely irrelevant at a later date. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1877-1882). O'Reilly Media. Kindle Edition.
Downloading the “logo” (image) from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" http:// www.pythonscraping.com") bsObj = BeautifulSoup( html) imageLocation = bsObj.find(" a", {" id": "logo"}). find(" img")[" src"] urlretrieve (imageLocation, "logo.jpg") Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1889-1896). O'Reilly Media. Kindle Edition.
#Chapter 5: 1-getPageMedia.py #Web Scraping with Python by Ryan Mitchell #Chapter 5: 1-getPageMedia.py import os from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup downloadDirectory = "downloaded" baseUrl = "http://pythonscraping.com"
def getAbsoluteURL(baseUrl, source): if source. startswith("http://www def getAbsoluteURL(baseUrl, source): if source.startswith("http://www."): url = "http://"+source[11:] elif source.startswith("http://"): url = source elif source.startswith("www."): url = source[4:] url = "http://"+source else: url = baseUrl+"/"+source if baseUrl not in url: return None return url
def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory): path = absoluteUrl.replace("www.", "") path = path.replace(baseUrl, "") path = downloadDirectory+path directory = os.path.dirname(path) if not os.path.exists(directory): os.makedirs(directory) return path
html = urlopen("http://www. pythonscraping html = urlopen("http://www.pythonscraping.com") bsObj = BeautifulSoup(html, "html.parser") downloadList = bsObj.findAll(src=True) for download in downloadList: fileUrl = getAbsoluteURL(baseUrl, download["src"]) if fileUrl is not None: print(fileUrl) urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))
Run with Caution Warnings on downloading unknown files from the internet This script just downloads EVERY THING !! Bash scripts, .exe files, maleware Never Scrape as root Image downloading to ../../../../usr/bin/python And the next time someone runs Python !?!?
Storing to CSV files #Chapter 5: 2-createCsv.py import csv #from os import open csvFile = open("../files/test.csv", 'w+', newline='') try: writer = csv.writer(csvFile) writer.writerow(('number', 'number plus 2', 'number times 2')) for i in range(10): writer.writerow( (i, i+2, i*2)) finally: csvFile.close()
Retrieving HTML tables Doing once use Excel and save as csv Doing it 50 times write a Python Script
#Chapter 5: 3-scrapeCsv.py import csv from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors") bsObj = BeautifulSoup(html, "html.parser") #The main comparison table is currently the first table on the page table = bsObj.findAll("table",{"class":"wikitable"})[0] rows = table.findAll("tr")
csvFile = open("files/editors csvFile = open("files/editors.csv", 'wt', newline='', encoding='utf-8') writer = csv.writer(csvFile) try: for row in rows: csvRow = [] for cell in row.findAll(['td', 'th']): csvRow.append(cell.get_text()) writer.writerow(csvRow) finally: csvFile.close()
Storing in Databases MySQL Microsoft’s Sequel server Oracle’s DBMS Why use MySQL? Free Used by the big boys: YouTube, Twitter, Facebook So ubiquity, price, “out of the box usability”
Relational Databases
SQL – Structured Query Language? SELECT * FROM users WHERE firstname = "Ryan"
Installing $ sudo apt-get install mysl-server
Some Basic MySQL commands CREATE DATABASE scraping; USE scraping; CREATE TABLE pages; error CREATE TABLE pages (id BIGINT( 7) NOT NULL AUTO_INCREMENT, title VARCHAR( 200), content VARCHAR( 10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id)); DESCRIBE pages;
> INSERT INTO pages (title, content) VALUES (" Test page title", "This is some te st page content. It can be up to 10,000 characters long."); Of course, we can override these defaults: INSERT INTO pages (id, title, content, created) VALUES (3, "Test page title", " This is some test page content. It can be up to 10,000 characters long.", "2014- 09-21 10: 25: 32"); Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2097-2101). O'Reilly Media. Kindle Edition.
SELECT * FROM pages WHERE id = 2; SELECT * FROM pages WHERE title LIKE "% test%"; SELECT id, title FROM pages WHERE content LIKE "% page content%"; Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2106-2115). O'Reilly Media. Kindle Edition.
Read once; decode; treat as file line by line
#Chapter 5: 3-mysqlBasicExample.py import pymysql conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd=None, db='mysql') cur = conn.cursor() cur.execute("USE scraping") cur.execute("SELECT * FROM pages WHERE id=1") print(cur.fetchone()) cur.close() conn.close()
# 5-storeWikiLinks.py from urllib.request import urlopen from bs4 import BeautifulSoup import re import datetime import random import pymysql conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd=None, db='mysql', charset='utf8') cur = conn.cursor() cur.execute("USE scraping")
def getLinks(articleUrl): html = urlopen("http://en. wikipedia def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "html.parser") title = bsObj.find("h1").get_text() content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text() store(title, content) return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon") try: while len(links) > 0: newArticle = links[random.randint(0, len(links)-1)].attrs["href"] print(newArticle) links = getLinks(newArticle) finally: cur.close() conn.close()
Next Time Requests Library and DB Requests: HTTP for Humans >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...' >>> r.json() {u'private_gists': 419, u'total_private_repos': 77, ...}
Python-Tips https://pythontips.com/2013/09/01/best-python-resources/ …