Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing

Slides:



Advertisements
Similar presentations
A batch file is a file that contains a number of DOS commands, each of which could be run individually from the command prompt. By putting them into a.
Advertisements

COMPUTER PROGRAMMING Task 1 LEVEL 6 PROGRAMMING: Be able to use a text based language like Python and JavaScript & correctly use procedures and functions.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Parts of a computer Objective:
Accessing the Internet with Anonymous FTP Transferring Files from Remote Computers.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
The Command Line Interface. The OS files IO.sys MSDOS.sys Command.com IO and MSDOS are hidden files, COMMAND.COM shows in directory listings.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
Basic Instructions on how to use One Drive and share files. ONE Drive Your LogoYour own footer.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CIS101 Introduction to Computing Week 01. Agenda What is CIS101? Class Introductions Using your Pace Introduction to Blackboard and online learning.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
 Use the web browser for the AES application only  Use only the application(s) required by the course material.
Saving Work to Your School Server Click through this presentation at your own speed. Use it as a review or a guide while saving a project.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
You may have already done this… Download the class files to the desktop Expand those files into root of USB stick Change your USB stick to drive “Z”!
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Quick Launch. Google Drive 30 GB Cloud Space Document.
Matt Monjan September 26, 2007 Creating a Photo Story.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
KompoZer. What is it? A FREE product used to design websites A FREE product used to design websites A WYSIWYG HTML Editor A WYSIWYG HTML Editor –WYSIWYG:
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
PHP Form Processing * referenced from
Install CB 1.8 on Ubuntu. Steps Followed Install Ubuntu (Ubuntu LTS) on Virtual machine – (VMware Workstation) (
Python: File Directories What is a directory? A hierarchical file system that contains folders and files. Directory (root folder) Sub-directory (folder.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
MySQL Installation Tarik Booker CS 122.
Command Line Basics.
Content from Python Docs.
Dropbox Basics.
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
Computation with strings 2 Day 3 - 9/02/16
By Jonathan Rinfret CREATING A BASH SCRIPT By Jonathan Rinfret
Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Computation with strings 3 Day 4 - 9/07/16
Biology: migration Day 7
Computation with strings 1 Day 2 - 8/31/16
Ecology: predator-prey models Day 15
Week 1 Computer Programming Year 9 – Unit 9.04
Regular expressions 2 Day /23/16
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
File IO and Strings CIS 40 – Introduction to Programming in Python
LING 3820 & 6820 Natural Language Processing Harry Howard
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Regular expressions 3 Day /26/16
Introduction to Syntax ANTH 3590/7590 Harry Howard Tulane University
LING 388: Computers and Language
Scripts In Matlab.
Find the measures of the angles in each triangle. #1. #2. #3.
Computation with strings 4 Day 5 - 9/09/16
Put each equation into the form
Put each equation into the form
Find the measures of the angles in each triangle. #1. #2. #3.
Bell Ringer #1. 3x + 1 = 10 #2. 18 – 2q = 4 #3. 8w – 12 = -4
CYB 130 RANK Dreams Come True / cyb130rank.com.
ME 123 Computer Applications I Lecture 4: Vectors and Matrices 3/14/03
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/NLP/ 1.1.7. Schedule of assignments Is there anyone here that wasn't here last week? NLP, Prof. Howard, Tulane University 12-Sep-2016

Review The quiz was the review. NLP, Prof. Howard, Tulane University 12-Sep-2016

5. Flat text Now that you have gotten a taste of Python, let us turn to the main course, textual computing or the computational analysis of text. But we do not have a text to work with yet, so let’s go and find one. NLP, Prof. Howard, Tulane University 12-Sep-2016

7.1. How to get a text from an on-line archive The first step is to figure out where to put the file. NLP, Prof. Howard, Tulane University 12-Sep-2016

How to navigate folders with os # check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 12-Sep-2016

Project Gutenberg http://www.gutenberg.org/ebooks/28554 NLP, Prof. Howard, Tulane University 12-Sep-2016

How to download a file with requests and convert it to a string with read() >>> import requests >>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554. txt' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] NLP, Prof. Howard, Tulane University 12-Sep-2016

How to save a file to your hard drive # it is assumed that Python is looking at your pyScripts folder >>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF NLP, Prof. Howard, Tulane University 12-Sep-2016

How to read a file from your hard drive >>> tempF = open('Wub.txt','r') >>> doc = tempF.read() >>> tempF.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 12-Sep-2016

Find out about it >>> type(doc) >>> len(doc) >>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 12-Sep-2016

How to slice away what you don’t need >>> text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = text.index('\n',lineIndex) >>> text[:startIndex] >>> text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> story = text[startIndex:endIndex] NLP, Prof. Howard, Tulane University 12-Sep-2016

Now save it as “Wub.txt” # it is assumed that Python is looking at your pyScripts folder >>> tempFile = open('Wub.txt','w') >>> tempFile.write(story.encode('utf8')) >>> tempFile.close() NLP, Prof. Howard, Tulane University 12-Sep-2016

Homework Get another text from Project Gutenberg onto your computer. (NOT YET) Turn the commands reviewed above into a function in a script that takes a url and the name of a text file as arguments and results in a Project Gutenberg file being saved to your pyScripts folder without the Project Gutenberg header & footer. NLP, Prof. Howard, Tulane University 12-Sep-2016

Next time Other sources of flat text NLP, Prof. Howard, Tulane University 12-Sep-2016