Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization http://www.tulane.edu/~howard/NLP/ 1.1.7. Schedule of assignments NLP, Prof. Howard, Tulane University 14-Sep-2016
Review I am going to review everything, because I have expanded on what I said Monday. NLP, Prof. Howard, Tulane University 14-Sep-2016
5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 14-Sep-2016
How to navigate folders with os # check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 14-Sep-2016
Project Gutenberg http://www.gutenberg.org/ebooks/28554 NLP, Prof. Howard, Tulane University 14-Sep-2016
How to download a file with requests and convert it to a string with read() >>> import requests >>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554.txt' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] ($> pip install chardet) >>> import chardet >>> chardet.detect(download) NLP, Prof. Howard, Tulane University 14-Sep-2016
How to start a function for recurring file operations # In "textProc.py" def gutenLoader(url, name): import requests download = requests.get(url).text NLP, Prof. Howard, Tulane University 14-Sep-2016
How to use try to catch errors ... download = requests.get(url).text ... except: ... print 'Download failed!' ... NLP, Prof. Howard, Tulane University 14-Sep-2016
Add the try block to gutenLoader(): # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' NLP, Prof. Howard, Tulane University 14-Sep-2016
Warning Project Gutenberg keeps track of how frequently you access it and will ask you to prove that you are human with a captcha. You will know that this has happened if the text that you downloaded is actually a bunch of HTML, as illustrated in the appendix A snippet of Project Gutenberg’s captcha page. Since requests does download a sort of text, it does not throw an exception. NLP, Prof. Howard, Tulane University 14-Sep-2016
How to save a file to your hard drive >>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF >>> import os >>> os.listdir('.') NLP, Prof. Howard, Tulane University 14-Sep-2016
How to read a file from your hard drive >>> tempFile = open('Wub.txt','r') >>> doc = tempFile.read() >>> tempFile.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 14-Sep-2016
Find out about it >>> type(doc) >>> len(doc) >>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 14-Sep-2016
How to read from and write to a file >>> doc1 = open('Wub.txt', 'r').read() >>> tempText = doc1.replace('Gutenberg', 'GUTENBERG') >>> tempText = tempText.encode('utf8') >>> tempFile = open('Wub2.txt','w') >>> tempFile.write(tempText) >>> tempFile.close() # examine result >>> doc3 = open('Wub2.txt', 'r').read() >>> doc3[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016
How to simplify file operations with the with statement >>> with open('Wub.txt','r') as tempFile: ... text = tempFile.read() ... text = text.replace('Gutenberg', 'GUTENBERG') ... >>> with open('Wub3.txt','w') as tempFile: ... tempFile.write(text) # test >>> doc4 = open('Wub3.txt', 'r').read() >>> doc4[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016
Add the with block to gutenLoader(): # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) NLP, Prof. Howard, Tulane University 14-Sep-2016
How to refresh your script with reload() (>>> import textProc) >>> reload(textProc) NLP, Prof. Howard, Tulane University 14-Sep-2016
How to call your function >>> url = 'http://www.gutenberg.org/cache/epub/31516/pg31516. txt' >>> name = 'Eyes.txt' >>> from textProc import gutenLoader >>> gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016
How to get your function to communicate with the outside world with return # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016
Call it by way of print >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) # open it with open('Wub.txt','r') as tempFile: ... download = tempFile.read() ... NLP, Prof. Howard, Tulane University 14-Sep-2016
How to slice away what you don’t need >>> download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = download.index('\n',lineIndex) >>> download[:startIndex] >>> download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> text = download[startIndex:endIndex] NLP, Prof. Howard, Tulane University 14-Sep-2016
Add this code to the function # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016
Call the function as before (>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016
Next time We more or less did Practice 1 today. Do Practice 2. Other sources of flat text. NLP, Prof. Howard, Tulane University 14-Sep-2016