Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

CHAPTER 3 MORE ON FORM HANDLING INCLUDING MULTIPLE FILES WRITING FUNCTIONS.
Files Introduction to Computing Science and Programming I.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Creating Dynamic Web Pages Using PHP and MySQL CS 320.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Lecture Note 1: Getting Started With ASP.  Introduction to ASP  Introduction to ASP An ASP file can contain text, HTML tags and scripts. Scripts in.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Java Programming, Second Edition Appendix A Working with Java SDK 1.4.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
PHP Form Processing * referenced from
Today… Strings: –String Methods Demo. Raising Exceptions. os Module Winter 2016CISC101 - Prof. McLeod1.
Python: File Directories What is a directory? A hierarchical file system that contains folders and files. Directory (root folder) Sub-directory (folder.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Outline of Script Import Modules Setup Workspace Environment and Assign Data Path Variables Summary of Script Title and Author Info.
Clojure.  Follow instructions on lynda.com  Can install all this in other ways, but this is the easiest way to get a development environment working.
Development Environment
Topic: File Input/Output (I/O)
Fundamentals of Python: First Programs
Lesson 08: Files Class Participation: Class Chat: Attendance Code 
Writing & reading txt files with python 3
USING PYTHON to Automate data management tasks
Introduction to Computing Science and Programming I
Intro to Python Programming – Part II
PYTHON: AN INTRODUCTION
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing
Computation with strings 2 Day 3 - 9/02/16
Corpus Linguistics I ENG 617
Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Computation with strings 3 Day 4 - 9/07/16
Computation with strings 1 Day 2 - 8/31/16
Digital Literacy Computational Thinking and Coding
1 Python Lab #1 Intro to Python Adriane Huber Debbie Bartlett.
Ecology: predator-prey models Day 15
CISC103 Web Development Basics: Web site:
CompSci 101 Introduction to Computer Science
Learning to Program in Python
File Handling Programming Guides.
Regular expressions 2 Day /23/16
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
File IO and Strings CIS 40 – Introduction to Programming in Python
LING 3820 & 6820 Natural Language Processing Harry Howard
Teaching London Computing
CISC101 Reminders Quiz 2 graded. Assn 2 sample solution is posted.
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Regular expressions 3 Day /26/16
Lesson 08: Files Class Chat: Attendance: Participation
Monday, October 17: CS AP A Assignment -Create a netbeans Project with 3 class files. -create a method in each of the two class files you create.
CISC101 Reminders Assignment 3 due next Friday. Winter 2019
Computation with strings 4 Day 5 - 9/09/16
Winter 2019 CISC101 4/29/2019 CISC101 Reminders
Bryan Burlingame 17 April 2019
Starter Activities GCSE Python.
Introduction to JavaScript
Intro to Programming (in JavaScript)
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/NLP/ 1.1.7. Schedule of assignments NLP, Prof. Howard, Tulane University 14-Sep-2016

Review I am going to review everything, because I have expanded on what I said Monday. NLP, Prof. Howard, Tulane University 14-Sep-2016

5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 14-Sep-2016

How to navigate folders with os # check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 14-Sep-2016

Project Gutenberg http://www.gutenberg.org/ebooks/28554 NLP, Prof. Howard, Tulane University 14-Sep-2016

How to download a file with requests and convert it to a string with read() >>> import requests >>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554.txt' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] ($> pip install chardet) >>> import chardet >>> chardet.detect(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to start a function for recurring file operations # In "textProc.py" def gutenLoader(url, name): import requests download = requests.get(url).text NLP, Prof. Howard, Tulane University 14-Sep-2016

How to use try to catch errors ... download = requests.get(url).text ... except: ... print 'Download failed!' ... NLP, Prof. Howard, Tulane University 14-Sep-2016

Add the try block to gutenLoader(): # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' NLP, Prof. Howard, Tulane University 14-Sep-2016

Warning Project Gutenberg keeps track of how frequently you access it and will ask you to prove that you are human with a captcha. You will know that this has happened if the text that you downloaded is actually a bunch of HTML, as illustrated in the appendix A snippet of Project Gutenberg’s captcha page. Since requests does download a sort of text, it does not throw an exception. NLP, Prof. Howard, Tulane University 14-Sep-2016

How to save a file to your hard drive >>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF >>> import os >>> os.listdir('.') NLP, Prof. Howard, Tulane University 14-Sep-2016

How to read a file from your hard drive >>> tempFile = open('Wub.txt','r') >>> doc = tempFile.read() >>> tempFile.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 14-Sep-2016

Find out about it >>> type(doc) >>> len(doc) >>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to read from and write to a file >>> doc1 = open('Wub.txt', 'r').read() >>> tempText = doc1.replace('Gutenberg', 'GUTENBERG') >>> tempText = tempText.encode('utf8') >>> tempFile = open('Wub2.txt','w') >>> tempFile.write(tempText) >>> tempFile.close() # examine result >>> doc3 = open('Wub2.txt', 'r').read() >>> doc3[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

How to simplify file operations with the with statement >>> with open('Wub.txt','r') as tempFile: ... text = tempFile.read() ... text = text.replace('Gutenberg', 'GUTENBERG') ... >>> with open('Wub3.txt','w') as tempFile: ... tempFile.write(text) # test >>> doc4 = open('Wub3.txt', 'r').read() >>> doc4[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

Add the with block to gutenLoader(): # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to refresh your script with reload() (>>> import textProc) >>> reload(textProc) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to call your function >>> url = 'http://www.gutenberg.org/cache/epub/31516/pg31516. txt' >>> name = 'Eyes.txt' >>> from textProc import gutenLoader >>> gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to get your function to communicate with the outside world with return # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

Call it by way of print >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) # open it with open('Wub.txt','r') as tempFile: ... download = tempFile.read() ... NLP, Prof. Howard, Tulane University 14-Sep-2016

How to slice away what you don’t need >>> download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = download.index('\n',lineIndex) >>> download[:startIndex] >>> download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> text = download[startIndex:endIndex] NLP, Prof. Howard, Tulane University 14-Sep-2016

Add this code to the function # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

Call the function as before (>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

Next time We more or less did Practice 1 today. Do Practice 2. Other sources of flat text. NLP, Prof. Howard, Tulane University 14-Sep-2016