Web Scraping Lecture 11 - Document Encoding

Web Scraping Lecture 11 - Document Encoding
Topics File extensions Txt, utf-8, pdf, docx Readings: Chapter 6 January 26, 2017

Overview Last Time: Lecture 10 Selenium Webdriver
Software Architecture of systems Today: Chapter 6: document encodings Test 2 - thoughts References: Chapter 6

File Extensions f.jpg f.txt f.doc f.pdf f.docx f.html Internet Engineering Task Force (IETF) stores all of its published documents as HTML, PDF, and text files (see rfc/ rfc1149. txt

Text from urllib.request import urlopen textPage = urlopen(" print(textPage.read())

Unicode In the early 1990sThe Unicode Consortium attempted to bring about a universal text encoder by establishing encodings for every character that needs to be used in any text document, in any language. The goal was to include everything from the Latin alphabet this book is written in, to Cyrillic (кириллица), Chinese pictograms (象形), math and logic symbols (⨊, ≥), and even emoticons and “miscellaneous” symbols, such as the biohazard sign (☣) and peace symbol (☮). The resulting encoder, UTF-8, which stands for, confusingly, "Universal Character Set - Transformation Format 8 bit”.

Recall ASCII

0 100 0011 – C // the first bit =0 means ASCII Non ASCII

2-getUtf8Text.py from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" bsObj = BeautifulSoup(html, "html.parser") content = bsObj.find("div", {"id":"mw-content-text"}).get_text() content = bytes(content, "UTF-8") content = content.decode("UTF-8") print(content)

Meta tag Most English sites <meta charset=“utf-8” /> For international sites ?

CSV again

CSV - accessing individual columns
# 3-readingCsv.py from urllib.request import urlopen from io import StringIO import csv data = urlopen(" 'ignore') dataFile = StringIO(data) csvReader = csv.reader(dataFile) for row in csvReader: print("The album \""+row[0]+"\" was released in "+str(row[1]))

4-readingCsvDict.py from urllib.request import urlopen from io import StringIO import csv data = urlopen(" 'ignore') dataFile = StringIO(data) dictReader = csv.DictReader(dataFile) print(dictReader.fieldnames) for row in dictReader: print(row)

PDF – Portable Document Format
Adobe 1993 The pain of dealing with Microsoft “doc” files PDFMiner3K python library

PDFMiner 5-readPdf.py from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io import StringIO from io import open from urllib.request import urlopen

def readPDF(pdfFile): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, laparams=laparams) process_pdf(rsrcmgr, device, pdfFile) device.close() content = retstr.getvalue() retstr.close() return content pdfFile = urlopen(" 1.pdf") outputString = readPDF(pdfFile) print(outputString) pdfFile.close()

Microsoft .doc and .docx Proprietary .doc format - binary-file format was difficult to read and poorly supported by other word processors. In 2008 in an effort to get with the times and adopt a standard that was used by many other pieces of software, Microsoft decided to use the Open Office XML-based standard, which made the files compatible with open source and other software.

Reading .docx files from zipfile import ZipFile from urllib.request import urlopen from io import BytesIO from bs4 import BeautifulSoup wordFile = urlopen(" x").read() wordFile = BytesIO(wordFile) document = ZipFile(wordFile) xml_content = document.read('word/document.xml') wordObj = BeautifulSoup(xml_content.decode('utf-8')) textStrings = wordObj.findAll("w:t") for textElem in textStrings: print(textElem.text)

Test 1 Thursday Test Thursday Feb 16. You can bring a notes cheat to test 1, subject to the following restrictions: 8 ½ x 11 sheet of paper One side only Handwritten (nothing electronically generated) Not in black ink. This will be signed and turned in with your test.

Lectures Lecture 1 - Overview
Lecture 2 – Python classes, dictionaries, sets etc. Lecture 3 – BeautifulSoup Lecture 4 – Regular Expressions Lecture 5 – Regular Expressions II Lecture 6 - Crawling Lecture 7 - Scrapy Lecture 8 – Storing data Lecture 9 – Requests library Lecture 10 – Selenium Web driver

Give regular expressions that denotes the Languages:
Homework 2: Regular expressions CSCE 590 HW 2 - Regular expressions, Due Jan 29 Sunday night 11:55PM Give regular expressions that denotes the Languages: a) { strings x such that x starts with 'aa' followed by any number of b's and c's and then end in an 'a'. Phone numbers with optional area codes and optional extensions of the form " ext 432". Addresses a Python function definition that just has pass as the body A link in a web page What languages(sets of strings that match the re) are denoted by the following regular expressions: (a|b)[cde]{3,4}b \w+\W+ \d{3}-\d{2}-\d{4} )

Give a regular expressions that extracts the "login" for a USC student from their address (after the match one could use login=match.group(1) ) Write a Python program that processes a file line by line and cleans it by removing (re.sub) social security numbers (replacing with ) addresses (replacing with "") phone numbers (replacing with For extra credit replacing Soc-Sec numbers that leave the number but replace the first three digits with the last three and replace the last three with the first three in the original string, i.e becomes

Homework 3 Monday Feb 6 at 11:55PM Write a short program named 3_1.py (less than 10 lines) that imports only urlopen and BeautifulSoup and then builds a list of all links ( tag). Then it should process the list one element at a time and print the link. Use the URL . Finally it should print a count of the number of links. Modify the previous program to obatin 3_2.py that a) write to the file "allLinks.txt" b) write only the URL, i.e. the value of the href Run 5-getAllExternalLinks.py from chapter 3 on the URL . Modify the code to handle the exceptions that occur, by logging, then ignoring and continuing to handle the other links. Modify 5-getAllExternalLinks.py to check the website 5-getAllExternalLinks.py for "Bad Links" (404 is sufficient).

Homework 4 Copy the table from the Master Schedule of CSCE courses online (^A^C select all then copy) and paste (^V) into excel then save as a CSV file, sched.csv Write a program, table.py, to grab this same table. Use Requests to login to the CSE site ( and then use BeautifulSoup to prettify the page returned.

Lectures Lecture 1 - Overview
Lecture 2 – Python classes, dictionaries, sets etc. Lecture 3 – BeautifulSoup Lecture 4 – Regular Expressions Lecture 5 – Regular Expressions II Lecture 6 - Crawling Lecture 7 - Scrapy Lecture 8 – Storing data Lecture 9 – Requests library Lecture 10 – Selenium Web driver

Web Scraping Lecture 11 - Document Encoding

Similar presentations

Presentation on theme: "Web Scraping Lecture 11 - Document Encoding"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Scraping Lecture 11 - Document Encoding

Similar presentations

Presentation on theme: "Web Scraping Lecture 11 - Document Encoding"— Presentation transcript:

Similar presentations

About project

Feedback