Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization http://www.tulane.edu/~howard/NLP/ 1.1.7. Schedule of assignments NLP, Prof. Howard, Tulane University 16-Sep-2016
Review NLP, Prof. Howard, Tulane University 16-Sep-2016
Project Gutenberg http://www.gutenberg.org/ebooks/28554 NLP, Prof. Howard, Tulane University 16-Sep-2016
Add this code to the function # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 16-Sep-2016
Call the function as before (>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 16-Sep-2016
Practices Practice 1: try this with another PG text. Practice 2: add comments NLP, Prof. Howard, Tulane University 16-Sep-2016
5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 16-Sep-2016
Install textract & chardet http://textract.readthedocs.io/en/latest/installation .html Mac Install homebrew: http://brew.sh/ $ brew install poppler antiword unrtf tesseract All ($ pip install tesseract) $ pip install textract $ pip install chardet NLP, Prof. Howard, Tulane University 16-Sep-2016
EPUBs NLP, Prof. Howard, Tulane University 16-Sep-2016
Convert EPUBs >>> from requests import get >>> url = 'http://www.gutenberg.org/ebooks/28554.epub.noimage s' >>> response = get(url) >>> type(response) <class 'requests.models.Response'> NLP, Prof. Howard, Tulane University 16-Sep-2016
More about the Response object >>> response.headers {'content-length': '16922', 'x-varnish': '1988218503', 'x- powered-by': '3', 'set-cookie': 'session_id=c91e2c01ad330b816664af3600b141ed13f5be9 4; Domain=.gutenberg.org; expires=Thu, 15 Sep 2016 13:05:50 GMT; Path=/', 'age': '0', 'server': 'Apache', 'x- connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57', 'date': 'Thu, 15 Sep 2016 12:35:50 GMT', 'x-frame-options': 'sameorigin', 'content-type': 'application/epub+zip'} >>> response.text[:150] >>> response.content[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016
Textract expects to read a file from disk with open('Wub.epub','wb') as tempFile: tempFile.write(response.content) from textract import process rawText = process('Wub.epub') type(rawText) from chardet import detect detect(rawText) len(rawText) # 34361 rawText[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016
Review from requests import get from textract import process url = 'http://www.cwanderson.org/wp- content/uploads/2011/11/Philip-K-Dick-The-Minority- Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016
PDFs NLP, Prof. Howard, Tulane University 16-Sep-2016
Download & convert a PDF url = 'http://www.cwanderson.org/wp- content/uploads/2011/11/Philip-K-Dick-The-Minority-Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016
Images NLP, Prof. Howard, Tulane University 16-Sep-2016
Download & write an image from requests import get url = 'http://i.stack.imgur.com/t3qWG.png' try: response = get(url) except: print 'Download failed!' with open('ocrTest.png', "wb") as tempFile: tempFile.write(response.content) NLP, Prof. Howard, Tulane University 16-Sep-2016
Try to OCR it >>> from textract import process >>> rawText = process('ocrTest.png') # Switch to Terminal for tesseract $ cd /Users/harryhow/Documents/pyScripts $ tesseract ocrTest.png ocrText NLP, Prof. Howard, Tulane University 16-Sep-2016
Q1 stats Q1 MIN 6.0 AVG 9.3 MAX 10.0 NLP, Prof. Howard, Tulane University 16-Sep-2016
Next time Q2 Regex NLP, Prof. Howard, Tulane University 16-Sep-2016