Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:

Advertisements

Similar presentations

Don’t Type it! OCR it! How to use an online OCR..

Advertisements

CHAPTER 3 MORE ON FORM HANDLING INCLUDING MULTIPLE FILES WRITING FUNCTIONS.

Files Introduction to Computing Science and Programming I.

1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.

UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005.

COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.

ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Creating Dynamic Web Pages Using PHP and MySQL CS 320.

COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Lecture Note 1: Getting Started With ASP.  Introduction to ASP  Introduction to ASP An ASP file can contain text, HTML tags and scripts. Scripts in.

SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Java Programming, Second Edition Appendix A Working with Java SDK 1.4.

ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

PHP Form Processing * referenced from

Today… Strings: –String Methods Demo. Raising Exceptions. os Module Winter 2016CISC101 - Prof. McLeod1.

Python: File Directories What is a directory? A hierarchical file system that contains folders and files. Directory (root folder) Sub-directory (folder.

CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Outline of Script Import Modules Setup Workspace Environment and Assign Data Path Variables Summary of Script Title and Author Info.

Clojure.  Follow instructions on lynda.com  Can install all this in other ways, but this is the easiest way to get a development environment working.

Development Environment

Topic: File Input/Output (I/O)

Fundamentals of Python: First Programs

Lesson 08: Files Class Participation: Class Chat: Attendance Code 

Writing & reading txt files with python 3

USING PYTHON to Automate data management tasks

Introduction to Computing Science and Programming I

Intro to Python Programming – Part II

PYTHON: AN INTRODUCTION

Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing

Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing

Computation with strings 2 Day 3 - 9/02/16

Corpus Linguistics I ENG 617

Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Computation with strings 3 Day 4 - 9/07/16

Computation with strings 1 Day 2 - 8/31/16

Digital Literacy Computational Thinking and Coding

1 Python Lab #1 Intro to Python Adriane Huber Debbie Bartlett.

Ecology: predator-prey models Day 15

CISC103 Web Development Basics: Web site:

CompSci 101 Introduction to Computer Science

Learning to Program in Python

File Handling Programming Guides.

Regular expressions 2 Day /23/16

control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing

File IO and Strings CIS 40 – Introduction to Programming in Python

LING 3820 & 6820 Natural Language Processing Harry Howard

Teaching London Computing

CISC101 Reminders Quiz 2 graded. Assn 2 sample solution is posted.

Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing

NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing

Regular expressions 3 Day /26/16

Lesson 08: Files Class Chat: Attendance: Participation

Monday, October 17: CS AP A Assignment -Create a netbeans Project with 3 class files. -create a method in each of the two class files you create.

CISC101 Reminders Assignment 3 due next Friday. Winter 2019

Computation with strings 4 Day 5 - 9/09/16

Winter 2019 CISC101 4/29/2019 CISC101 Reminders

Bryan Burlingame 17 April 2019

Starter Activities GCSE Python.

Introduction to JavaScript

Intro to Programming (in JavaScript)

Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing

Presentation transcript:

Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/NLP/ 1.1.7. Schedule of assignments NLP, Prof. Howard, Tulane University 14-Sep-2016

Review I am going to review everything, because I have expanded on what I said Monday. NLP, Prof. Howard, Tulane University 14-Sep-2016

5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 14-Sep-2016

How to navigate folders with os # check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 14-Sep-2016

Project Gutenberg http://www.gutenberg.org/ebooks/28554 NLP, Prof. Howard, Tulane University 14-Sep-2016

How to download a file with requests and convert it to a string with read() >>> import requests >>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554.txt' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] ($> pip install chardet) >>> import chardet >>> chardet.detect(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to start a function for recurring file operations # In "textProc.py" def gutenLoader(url, name): import requests download = requests.get(url).text NLP, Prof. Howard, Tulane University 14-Sep-2016

How to use try to catch errors ... download = requests.get(url).text ... except: ... print 'Download failed!' ... NLP, Prof. Howard, Tulane University 14-Sep-2016

Add the try block to gutenLoader(): # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' NLP, Prof. Howard, Tulane University 14-Sep-2016

Warning Project Gutenberg keeps track of how frequently you access it and will ask you to prove that you are human with a captcha. You will know that this has happened if the text that you downloaded is actually a bunch of HTML, as illustrated in the appendix A snippet of Project Gutenberg’s captcha page. Since requests does download a sort of text, it does not throw an exception. NLP, Prof. Howard, Tulane University 14-Sep-2016

How to save a file to your hard drive >>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF >>> import os >>> os.listdir('.') NLP, Prof. Howard, Tulane University 14-Sep-2016

How to read a file from your hard drive >>> tempFile = open('Wub.txt','r') >>> doc = tempFile.read() >>> tempFile.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 14-Sep-2016

Find out about it >>> type(doc) >>> len(doc) >>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to read from and write to a file >>> doc1 = open('Wub.txt', 'r').read() >>> tempText = doc1.replace('Gutenberg', 'GUTENBERG') >>> tempText = tempText.encode('utf8') >>> tempFile = open('Wub2.txt','w') >>> tempFile.write(tempText) >>> tempFile.close() # examine result >>> doc3 = open('Wub2.txt', 'r').read() >>> doc3[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

How to simplify file operations with the with statement >>> with open('Wub.txt','r') as tempFile: ... text = tempFile.read() ... text = text.replace('Gutenberg', 'GUTENBERG') ... >>> with open('Wub3.txt','w') as tempFile: ... tempFile.write(text) # test >>> doc4 = open('Wub3.txt', 'r').read() >>> doc4[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

Add the with block to gutenLoader(): # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to refresh your script with reload() (>>> import textProc) >>> reload(textProc) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to call your function >>> url = 'http://www.gutenberg.org/cache/epub/31516/pg31516. txt' >>> name = 'Eyes.txt' >>> from textProc import gutenLoader >>> gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to get your function to communicate with the outside world with return # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

Call it by way of print >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) # open it with open('Wub.txt','r') as tempFile: ... download = tempFile.read() ... NLP, Prof. Howard, Tulane University 14-Sep-2016

How to slice away what you don’t need >>> download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = download.index('\n',lineIndex) >>> download[:startIndex] >>> download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> text = download[startIndex:endIndex] NLP, Prof. Howard, Tulane University 14-Sep-2016

Add this code to the function # In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

Call the function as before (>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

Next time We more or less did Practice 1 today. Do Practice 2. Other sources of flat text. NLP, Prof. Howard, Tulane University 14-Sep-2016