Download presentation
Presentation is loading. Please wait.
Published byAmy Ball Modified over 6 years ago
1
Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing
Harry Howard Tulane University
2
Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments Is there anyone here that wasn't here last week? NLP, Prof. Howard, Tulane University 12-Sep-2016
3
Review The quiz was the review. NLP, Prof. Howard, Tulane University
12-Sep-2016
4
5. Flat text Now that you have gotten a taste of Python, let us turn to the main course, textual computing or the computational analysis of text. But we do not have a text to work with yet, so let’s go and find one. NLP, Prof. Howard, Tulane University 12-Sep-2016
5
7.1. How to get a text from an on-line archive
The first step is to figure out where to put the file. NLP, Prof. Howard, Tulane University 12-Sep-2016
6
How to navigate folders with os
# check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 12-Sep-2016
7
Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 12-Sep-2016
8
How to download a file with requests and convert it to a string with read()
>>> import requests >>> url = ' txt' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] NLP, Prof. Howard, Tulane University 12-Sep-2016
9
How to save a file to your hard drive
# it is assumed that Python is looking at your pyScripts folder >>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF NLP, Prof. Howard, Tulane University 12-Sep-2016
10
How to read a file from your hard drive
>>> tempF = open('Wub.txt','r') >>> doc = tempF.read() >>> tempF.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 12-Sep-2016
11
Find out about it >>> type(doc) >>> len(doc)
>>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 12-Sep-2016
12
How to slice away what you don’t need
>>> text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = text.index('\n',lineIndex) >>> text[:startIndex] >>> text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> story = text[startIndex:endIndex] NLP, Prof. Howard, Tulane University 12-Sep-2016
13
Now save it as “Wub.txt” # it is assumed that Python is looking at your pyScripts folder >>> tempFile = open('Wub.txt','w') >>> tempFile.write(story.encode('utf8')) >>> tempFile.close() NLP, Prof. Howard, Tulane University 12-Sep-2016
14
Homework Get another text from Project Gutenberg onto your computer.
(NOT YET) Turn the commands reviewed above into a function in a script that takes a url and the name of a text file as arguments and results in a Project Gutenberg file being saved to your pyScripts folder without the Project Gutenberg header & footer. NLP, Prof. Howard, Tulane University 12-Sep-2016
15
Next time Other sources of flat text
NLP, Prof. Howard, Tulane University 12-Sep-2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.