NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University
09-Sept-2009LING , Prof. Howard, Tulane University2 Course organization I have requested that NLTK be installed on the computers in this room.
NLPP §2 Accessing text corpora and lexical resources §2.1 Accessing text corpora
09-Sept-2009LING , Prof. Howard, Tulane University4 What's that word What is a corpus/corpora? "large bodies of linguistic data"
09-Sept-2009LING , Prof. Howard, Tulane University5 Some corpora in NLTK The Project Gutenberg electronic text archive 25k free electronic books at Web and chat text The Brown corpus First 1M word e-corpus, from 500 sources The Reuters corpus The Inaugural Address corpus Annotated text corpora Corpora in other languages
09-Sept-2009LING , Prof. Howard, Tulane University6 Using corpora in NLTK Only the corpora in the nltk.book corpus are formatted as lists and so can be arguments to NLTK functions. To convert another corpus into a list, use: your_text_name = nltk.Text(corpus_name)
09-Sept-2009LING , Prof. Howard, Tulane University7 Basic corpus functions Table 2.3 ExampleDescription fileids() the files of the corpus categories() the categories of the corpus fileids([categories]) the files of the corpus corresponding to these categories categories([fileids]) the categories of the corpus corresponding to these files raw() the raw content of the corpus raw(fileids=[f1,f2,f3]) the raw content of the specified files raw(categories=[c1,c2]) the raw content of the specified categories
09-Sept-2009LING , Prof. Howard, Tulane University8 Basic corpus functions Table 2.3 ExampleDescription words() the words of the whole corpus words(fileids=[f1,f2,f3]) the words of the specified fileids words(categories=[c1,c2]) the words of the specified categories sents() the sentences of the whole corpus sents(fileids=[f1,f2,f3]) the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories
09-Sept-2009LING , Prof. Howard, Tulane University9 Code to get started >>> from nltk.corpus import gutenberg >>> >>> emma = gutenberg.words('austen-emma.txt') >>> >>> emma = nltk.Text(emma) >>> >>> emma.collocations() Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss Fairfax; young man; great deal; John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin; Colonel Campbell; Box Hill; Harriet Smith; William Larkins; Brunswick Square; young lady; young woman; Miss Hawkins
09-Sept-2009LING , Prof. Howard, Tulane University10 Loading your own corpus Table 2.3 ExampleDescription abspath(fileid) the location of the file on disk encoding(fileid) the encoding of the file (if known) open(fileid) open a stream for reading the given corpus file root() the path to the root of locally installed corpus readme() the contents of the README file of the corpus
NLPP §2 Accessing text corpora and lexical resources §2.2 Conditional frequency distributions
09-Sept-2009LING , Prof. Howard, Tulane University12 Back to frequency FreqDist(mylist) calculates the number of occurrences of each item in 'mylist'. ConditionalFreqDist(mypairs) calculates the number of occurrences of each pair of items in 'mypairs', where the pairing might be of author & word, genre & word, topic & word, etc.: condition & text
09-Sept-2009LING , Prof. Howard, Tulane University13 An example >>> from nltk.corpus import brown >>> cfd = nltk.ConditionalFreqDist(... (genre, word)... for genre in brown.categories()... for word in brown.words(categories=genre))
Next time NLPP: §2.3ff Do "Your Turn" up to p. 55 Exercises , 2.8.8