Download presentation
Presentation is loading. Please wait.
Published byOswin Morton Modified over 6 years ago
1
Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments NLP, Prof. Howard, Tulane University 16-Sep-2016
3
Review NLP, Prof. Howard, Tulane University 16-Sep-2016
4
Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 16-Sep-2016
5
Add this code to the function
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 16-Sep-2016
6
Call the function as before
(>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 16-Sep-2016
7
Practices Practice 1: try this with another PG text.
Practice 2: add comments NLP, Prof. Howard, Tulane University 16-Sep-2016
8
5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 16-Sep-2016
9
Install textract & chardet
.html Mac Install homebrew: $ brew install poppler antiword unrtf tesseract All ($ pip install tesseract) $ pip install textract $ pip install chardet NLP, Prof. Howard, Tulane University 16-Sep-2016
10
EPUBs NLP, Prof. Howard, Tulane University 16-Sep-2016
11
Convert EPUBs >>> from requests import get
>>> url = ' s' >>> response = get(url) >>> type(response) <class 'requests.models.Response'> NLP, Prof. Howard, Tulane University 16-Sep-2016
12
More about the Response object
>>> response.headers {'content-length': '16922', 'x-varnish': ' ', 'x- powered-by': '3', 'set-cookie': 'session_id=c91e2c01ad330b816664af3600b141ed13f5be9 4; Domain=.gutenberg.org; expires=Thu, 15 Sep :05:50 GMT; Path=/', 'age': '0', 'server': 'Apache', 'x- connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57', 'date': 'Thu, 15 Sep :35:50 GMT', 'x-frame-options': 'sameorigin', 'content-type': 'application/epub+zip'} >>> response.text[:150] >>> response.content[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016
13
Textract expects to read a file from disk
with open('Wub.epub','wb') as tempFile: tempFile.write(response.content) from textract import process rawText = process('Wub.epub') type(rawText) from chardet import detect detect(rawText) len(rawText) # 34361 rawText[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016
14
Review from requests import get from textract import process
url = ' content/uploads/2011/11/Philip-K-Dick-The-Minority- Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016
15
PDFs NLP, Prof. Howard, Tulane University 16-Sep-2016
16
Download & convert a PDF
url = ' content/uploads/2011/11/Philip-K-Dick-The-Minority-Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016
17
Images NLP, Prof. Howard, Tulane University 16-Sep-2016
18
Download & write an image
from requests import get url = ' try: response = get(url) except: print 'Download failed!' with open('ocrTest.png', "wb") as tempFile: tempFile.write(response.content) NLP, Prof. Howard, Tulane University 16-Sep-2016
19
Try to OCR it >>> from textract import process
>>> rawText = process('ocrTest.png') # Switch to Terminal for tesseract $ cd /Users/harryhow/Documents/pyScripts $ tesseract ocrTest.png ocrText NLP, Prof. Howard, Tulane University 16-Sep-2016
20
Q1 stats Q1 MIN 6.0 AVG 9.3 MAX 10.0 NLP, Prof. Howard, Tulane University 16-Sep-2016
21
Next time Q2 Regex NLP, Prof. Howard, Tulane University 16-Sep-2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.