Web Scraping Lecture 8 – Storing Data

Slides:

Advertisements

Similar presentations

Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:

Advertisements

Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.

Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.

Tutorial 11: Connecting to External Data

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Linux Operations and Administration

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

Introduction to PHP and Server Side Technology. Slide 2 PHP History Created in 1995 PHP 5.0 is the current version It’s been around since 2004.

A Guide to SQL, Eighth Edition Chapter Three Creating Tables.

© 2011 Autodesk Automating Autodesk® Revit® Server Rod Howarth Software Development Manager – Bornhorst + Ward.

Session 5: Working with MySQL iNET Academy Open Source Web Development.

Chapter 1: Introduction to Web

Server-side Scripting Powering the webs favourite services.

Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizard’s Guide to PHP by David Lash.

5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.

Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.

9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.

1 In the good old days... Years ago… the WWW was made up of (mostly) static documents. –Each URL corresponded to a single file stored on some hard disk.

1 Working with MS SQL Server Textbook Chapter 14.

INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,

Creating Dynamic Web Pages Using PHP and MySQL CS 320.

Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.

Database control Introduction. The Database control is a tool that used by the database administrator to control the database. To enter to Database control.

Chapter 6 Server-side Programming: Java Servlets

ASP. ASP is a powerful tool for making dynamic and interactive Web pages An ASP file can contain text, HTML tags and scripts. Scripts in an ASP file are.

Unit 1 – Web Concepts Instructor: Brent Presley.

 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  

COSC 2328 – Web Programming.  PHP is a server scripting language  It’s widely-used and free  It’s an alternative to Microsoft’s ASP and Ruby  PHP.

SQL pepper. Why SQL File I/O is a great deal of code Optimal file organization and indexing is critical and a great deal of code and theory implementation.

Python: Programming the Google Search (Crawling) Damian Gordon.

COM621: Advanced Interactive Web Development Lecture 10 PHP and MySQL.

Chapter 12 Introducing Databases. Objectives What a database is and which databases are typically used with ASP.NET pages What SQL is, how it looks, and.

2440: 141 Web Site Administration Web Forms Instructor: Joseph Nattey.

2nd year Computer Science & Engineer

How do Web Applications Work?

Written by Anthony McNicoll

CSE 103 Day 20 Jo is out today; I’m Carl

Introduction to Dynamic Web Programming

Intro to Python Programming – Part III

CSCE 590 Web Scraping Lecture 3

USING PYTHON to Automate data management tasks

Node.js Express Web Applications

Web Scraping Lecture7 - Topics Readings: More Beautiful Soup Crawling

Node.js Express Web Services

PHP –MySQL Interview Question And Answer.

E-commerce | WWW World Wide Web - Concepts

Chapter 19 PHP Part III Credits: Parts of the slides are based on slides created by textbook authors, P.J. Deitel and H. M. Deitel by Prentice Hall ©

E-commerce | WWW World Wide Web - Concepts

Getting web pages First we need to get the webpage by issuing a HTTP request. The best option for this is the requests library that comes with Anaconda:

Basic Web Scraping with Python

Intro to PHP & Variables

Database Driven Websites

MVC Framework, in general.

ISC440: Web Programming 2 Server-side Scripting PHP 3

Web Scraping Lecture9 - Requests

Chapter 27 WWW and HTTP.

Web Systems Development (CSC-215)

Web Scraping Lecture 11 - Document Encoding

Scrapy Web Cralwer Instructor: Bei Kang.

Chapter 15 Introduction to Rails.

Web DB Programming: PHP

Lecture 2 - SQL Injection

590 Web Scraping – Handling Images

Web Scraping Lecture9 - Requests

CSCE 590 Web Scraping – Scrapy III

Tutorial 6 PHP & MySQL Li Xu

Web Scraping Lecture 10 - Selenium

CSCE 590 Web Scraping - Selenium

Web Application Development Using PHP

Presentation transcript:

Web Scraping Lecture 8 – Storing Data Topics Storing data Downloading CSV, MySQL Readings: Chapters 5, and 4 February 2, 2017

Overview Last Time: Lecture 6 slides 30- end; Lecture 7 Slides 1-31 Crawling from Chapter 3: Lecture 6 Slides 29-40 Getting code again: https://github.com/REMitchell/python-scraping 3-crawlSite.py 4-getExternalLinks.py 5-getAllExternalLinks.py Chapter 4 APIs JSON Today: Iterators, generators and yield Javascript References - Scrapy site/user manual

Reg Expressions – Lookahead patterns (?=...) Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'. (?!...) Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. (?#...) A comment; the contents of the parentheses are simply ignored.

Chapter 4: Using APIs API - In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. https://en.wikipedia.org/wiki/Application_programming_interface A web API is an application programming interface (API) for either a web server or a web browser. Program request in HTML Response in XML or JSON

Authentication Identify users – for charges etc. http:// developer.echonest.com/ api/ v4/ artist/ songs? api_key = < your api key here > %20& name = guns% 20n% 27% 20roses& format = json& start = 0& results = 100 Using urlopen token = "< your api key >" webRequest = urllib.request.Request(" http:// myapi.com", headers ={" token": token}) html = urlopen( webRequest)

Google Developers APIs

Mining the Social Web; so Twitter Later

Yield in Python def _get_child_candidates(self, distance, min_dist, max_dist): if self._leftchild and distance - max_dist < self._median: yield self._leftchild if self._rightchild and distance + max_dist >= self._median: yield self._rightchild result, candidates = list(), [self] while candidates: node = candidates.pop() distance = node._get_dist(obj) if distance <= max_dist and distance >= min_dist: result.extend(node._values) candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) return result https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/ http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

Iterators and Generators When you create a list, you can read its items one by one. Reading its items one by one is called iteration: >>> mylist = [1, 2, 3] >>> for i in mylist: ... print(i) Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly: >>> mygenerator = (x*x for x in range(3)) >>> for i in mygenerator: https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/

Yield Yield is a keyword that is used like return, except the function will return a generator. >>> def createGenerator(): ... mylist = range(3) ... for i in mylist: ... yield i*i ... >>> mygenerator = createGenerator() # create a generator >>> print(mygenerator) # mygenerator is an object! <generator object createGenerator at 0xb7555c34> >>> for i in mygenerator: ... print(i) https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/

#Chapter 4: 4-DecodeJson.py #Web Scraping with Python by Ryan Mitchell import json from urllib.request import urlopen def getCountry(ipAddress): response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8') responseJson = json.loads(response) return responseJson.get("country_code") print(getCountry("50.78.253.58"))

#Chapter 4: 5-jsonParsing.py import json jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}], "arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}' jsonObj = json.loads(jsonString) print(jsonObj.get("arrayOfNums")) print(jsonObj.get("arrayOfNums")[1]) print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number")) print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

Wiki Editing Histories – from where?

The Map --

#Chapter 4: 6-wikiHistories.py from urllib.request import urlopen from urllib.request import HTTPError from bs4 import BeautifulSoup import datetime import json import random import re random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "html.parser") return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl): #Format of revision history pages is: #http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=h istory pageUrl = pageUrl.replace("/wiki/", "") historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=hi story" print("history url is: "+historyUrl) html = urlopen(historyUrl) bsObj = BeautifulSoup(html, "html.parser") #finds only the links with class "mw-anonuserlink" which has IP addresses #instead of usernames ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"}) addressList = set() for ipAddress in ipAddresses: addressList.add(ipAddress.get_text()) return addressList

def getCountry(ipAddress): try: response = urlopen("http://freegeoip def getCountry(ipAddress): try: response = urlopen("http://freegeoip.net/json/"+ipAddress).read(). decode('utf-8') except HTTPError: return None responseJson = json.loads(response) return responseJson.get("country_code") links = getLinks("/wiki/Python_(programming_language)")

Output

Chapter 5 - Storing Data Files Csv files Json, xml

Downloading images : to copy or not As you are scraping do you download images or just store links? Advantages to not copying? Scrapers run much faster, and require much less bandwidth, when they don’t have to download files. You save space on your own machine by storing only the URLs. It is easier to write code that only stores URLs and doesn’t need to deal with additional file downloads. You can lessen the load on the host server by avoiding large file downloads.

Advantages to not copying? Embedding these URLs in your own website or application is known as hotlinking and doing it is a very quick way to get you in hot water on the Internet. You do not want to use someone else’s server cycles to host media for your own applications. The file hosted at any particular URL is subject to change. This might lead to embarrassing effects if, say, you’re embedding a hotlinked image on a public blog. If you’re storing the URLs with the intent to store the file later, for further research, it might eventually go missing or be changed to something completely irrelevant at a later date. Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1877-1882). O'Reilly Media. Kindle Edition.

Downloading the “logo” (image) from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" http:// www.pythonscraping.com") bsObj = BeautifulSoup( html) imageLocation = bsObj.find(" a", {" id": "logo"}). find(" img")[" src"] urlretrieve (imageLocation, "logo.jpg") Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 1889-1896). O'Reilly Media. Kindle Edition.

#Chapter 5: 1-getPageMedia.py #Web Scraping with Python by Ryan Mitchell #Chapter 5: 1-getPageMedia.py import os from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup downloadDirectory = "downloaded" baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source): if source. startswith("http://www def getAbsoluteURL(baseUrl, source): if source.startswith("http://www."): url = "http://"+source[11:] elif source.startswith("http://"): url = source elif source.startswith("www."): url = source[4:] url = "http://"+source else: url = baseUrl+"/"+source if baseUrl not in url: return None return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory): path = absoluteUrl.replace("www.", "") path = path.replace(baseUrl, "") path = downloadDirectory+path directory = os.path.dirname(path) if not os.path.exists(directory): os.makedirs(directory) return path

html = urlopen("http://www. pythonscraping html = urlopen("http://www.pythonscraping.com") bsObj = BeautifulSoup(html, "html.parser") downloadList = bsObj.findAll(src=True) for download in downloadList: fileUrl = getAbsoluteURL(baseUrl, download["src"]) if fileUrl is not None: print(fileUrl) urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

Run with Caution Warnings on downloading unknown files from the internet This script just downloads EVERY THING !! Bash scripts, .exe files, maleware Never Scrape as root Image downloading to ../../../../usr/bin/python And the next time someone runs Python !?!?

Storing to CSV files #Chapter 5: 2-createCsv.py import csv #from os import open csvFile = open("../files/test.csv", 'w+', newline='') try: writer = csv.writer(csvFile) writer.writerow(('number', 'number plus 2', 'number times 2')) for i in range(10): writer.writerow( (i, i+2, i*2)) finally: csvFile.close()

Retrieving HTML tables Doing once  use Excel and save as csv Doing it 50 times write a Python Script

#Chapter 5: 3-scrapeCsv.py import csv from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors") bsObj = BeautifulSoup(html, "html.parser") #The main comparison table is currently the first table on the page table = bsObj.findAll("table",{"class":"wikitable"})[0] rows = table.findAll("tr")

csvFile = open("files/editors csvFile = open("files/editors.csv", 'wt', newline='', encoding='utf-8') writer = csv.writer(csvFile) try: for row in rows: csvRow = [] for cell in row.findAll(['td', 'th']): csvRow.append(cell.get_text()) writer.writerow(csvRow) finally: csvFile.close()

Storing in Databases MySQL Microsoft’s Sequel server Oracle’s DBMS Why use MySQL? Free Used by the big boys: YouTube, Twitter, Facebook So ubiquity, price, “out of the box usability”

Relational Databases

SQL – Structured Query Language? SELECT * FROM users WHERE firstname = "Ryan"

Installing $ sudo apt-get install mysl-server

Some Basic MySQL commands CREATE DATABASE scraping; USE scraping; CREATE TABLE pages; error CREATE TABLE pages (id BIGINT( 7) NOT NULL AUTO_INCREMENT, title VARCHAR( 200), content VARCHAR( 10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id)); DESCRIBE pages;

> INSERT INTO pages (title, content) VALUES (" Test page title", "This is some te st page content. It can be up to 10,000 characters long."); Of course, we can override these defaults: INSERT INTO pages (id, title, content, created) VALUES (3, "Test page title", " This is some test page content. It can be up to 10,000 characters long.", "2014- 09-21 10: 25: 32"); Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2097-2101). O'Reilly Media. Kindle Edition.

SELECT * FROM pages WHERE id = 2; SELECT * FROM pages WHERE title LIKE "% test%"; SELECT id, title FROM pages WHERE content LIKE "% page content%"; Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 2106-2115). O'Reilly Media. Kindle Edition.

Read once; decode; treat as file line by line

#Chapter 5: 3-mysqlBasicExample.py import pymysql conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd=None, db='mysql') cur = conn.cursor() cur.execute("USE scraping") cur.execute("SELECT * FROM pages WHERE id=1") print(cur.fetchone()) cur.close() conn.close()

# 5-storeWikiLinks.py from urllib.request import urlopen from bs4 import BeautifulSoup import re import datetime import random import pymysql conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd=None, db='mysql', charset='utf8') cur = conn.cursor() cur.execute("USE scraping")

def getLinks(articleUrl): html = urlopen("http://en. wikipedia def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "html.parser") title = bsObj.find("h1").get_text() content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text() store(title, content) return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon") try: while len(links) > 0: newArticle = links[random.randint(0, len(links)-1)].attrs["href"] print(newArticle) links = getLinks(newArticle) finally: cur.close() conn.close()

Next Time Requests Library and DB Requests: HTTP for Humans >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...' >>> r.json() {u'private_gists': 419, u'total_private_repos': 77, ...}

Python-Tips https://pythontips.com/2013/09/01/best-python-resources/ …