NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University
31-Aug-2009LING , Prof. Howard, Tulane University2 Course organization I have requested that Python and NLTK be installed on the computers in this room.
NLPP §1.2 A Closer Look at Python: Texts as Lists of Words
31-Aug-2009LING , Prof. Howard, Tulane University4 Variables variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold']
31-Aug-2009LING , Prof. Howard, Tulane University5 How to name variables Valid names (or identifiers) … must start with a letter, optionally followed by digits or letters; are case-sensitive; cannot contain whitespace (use an underscore) or a dash (means minus); cannot be a reserved word.
31-Aug-2009LING , Prof. Howard, Tulane University6 Strings Strings are individual words, i.e. a single element list. Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python']
NLPP §1.3. Computing with Language: Simple Statistics
31-Aug-2009LING , Prof. Howard, Tulane University8 Frequency distribution What is a frequency distribution? It tells us the frequency of each vocabulary item in a text. It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. What function in NLTK calculates it? FreqDist(text_name) What expression lists the tokens with their distribution? text_name.keys()
31-Aug-2009LING , Prof. Howard, Tulane University9 Very frequent words How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True)
31-Aug-2009LING , Prof. Howard, Tulane University10 Very infrequent words Words that occur only once are called hapaxes. >>>fdist1.hapaxes() In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others. How would you describe them?
31-Aug-2009LING , Prof. Howard, Tulane University11 Summary Most frequentLeast frequent Lengthshortlong Meaningvery generalvery specific Coverage of textlarge proportionsmall proportion
31-Aug-2009LING , Prof. Howard, Tulane University12 Question Which group would you look in to find words that help you understand what the text is about? Neither.
31-Aug-2009LING , Prof. Howard, Tulane University13 Fine-grained word selection Some Python expressions are based on set theory. a) {w | w ∈ V & P(w)} b) [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?) Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15]
31-Aug-2009LING , Prof. Howard, Tulane University14 Finding words that characterize a text Not too short (>?) and not too infrequent (>?) >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7]
31-Aug-2009LING , Prof. Howard, Tulane University15 Finding groups of words What is the name for a sequence of two words? Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] What is the name for a sequence of words that occur together unusually often? Collocation ~ collocations() They are essentially bigrams that occur more often than we would expect based on the frequency of individual words.
31-Aug-2009LING , Prof. Howard, Tulane University16 Example >>> text4.collocations() Building collocations list United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money
31-Aug-2009LING , Prof. Howard, Tulane University17 Counting Other Things
Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2