Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments NLP, Prof. Howard, Tulane University 16-Sep-2016

Review NLP, Prof. Howard, Tulane University 16-Sep-2016

Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 16-Sep-2016

Add this code to the function
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 16-Sep-2016

Call the function as before
(>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 16-Sep-2016

Practices Practice 1: try this with another PG text.
Practice 2: add comments NLP, Prof. Howard, Tulane University 16-Sep-2016

5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 16-Sep-2016

Install textract & chardet
.html Mac Install homebrew: $ brew install poppler antiword unrtf tesseract All ($ pip install tesseract) $ pip install textract $ pip install chardet NLP, Prof. Howard, Tulane University 16-Sep-2016

EPUBs NLP, Prof. Howard, Tulane University 16-Sep-2016

Convert EPUBs >>> from requests import get
>>> url = ' s' >>> response = get(url) >>> type(response) <class 'requests.models.Response'> NLP, Prof. Howard, Tulane University 16-Sep-2016

More about the Response object
>>> response.headers {'content-length': '16922', 'x-varnish': ' ', 'x- powered-by': '3', 'set-cookie': 'session_id=c91e2c01ad330b816664af3600b141ed13f5be9 4; Domain=.gutenberg.org; expires=Thu, 15 Sep :05:50 GMT; Path=/', 'age': '0', 'server': 'Apache', 'x- connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57', 'date': 'Thu, 15 Sep :35:50 GMT', 'x-frame-options': 'sameorigin', 'content-type': 'application/epub+zip'} >>> response.text[:150] >>> response.content[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016

Textract expects to read a file from disk
with open('Wub.epub','wb') as tempFile: tempFile.write(response.content) from textract import process rawText = process('Wub.epub') type(rawText) from chardet import detect detect(rawText) len(rawText) # 34361 rawText[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016

Review from requests import get from textract import process
url = ' content/uploads/2011/11/Philip-K-Dick-The-Minority- Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016

PDFs NLP, Prof. Howard, Tulane University 16-Sep-2016

Download & convert a PDF
url = ' content/uploads/2011/11/Philip-K-Dick-The-Minority-Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016

Images NLP, Prof. Howard, Tulane University 16-Sep-2016

Download & write an image
from requests import get url = ' try: response = get(url) except: print 'Download failed!' with open('ocrTest.png', "wb") as tempFile: tempFile.write(response.content) NLP, Prof. Howard, Tulane University 16-Sep-2016

Try to OCR it >>> from textract import process
>>> rawText = process('ocrTest.png') # Switch to Terminal for tesseract $ cd /Users/harryhow/Documents/pyScripts $ tesseract ocrTest.png ocrText NLP, Prof. Howard, Tulane University 16-Sep-2016

Q1 stats Q1 MIN 6.0 AVG 9.3 MAX 10.0 NLP, Prof. Howard, Tulane University 16-Sep-2016

Next time Q2 Regex NLP, Prof. Howard, Tulane University 16-Sep-2016

Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback