Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)
Outline Review: Introduction to NLP (knowledge of language, ambiguity, representations and algorithms, applications) HW 2 discussion Tutorials: Basics, Probability
Python and Natural Language Processing Python is a great language for NLP: Simple Easy to debug: Exceptions Interpreted language Easy to structure Modules Object oriented programming Powerful string manipulation
Modules and Packages Python modules “package program code and data for reuse.” (Lutz) Similar to library in C, package in Java. Python packages are hierarchical modules (i.e., modules that contain other modules). Three commands for accessing modules: import from…import reload
Modules and Packages: import The import command loads a module: # Load the regular expression module >>> import re To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…]
Modules and Packages from…import The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str)
Import vs. from…import Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. from…import Puts module functions and user functions together. More convenient names. Does not work with reload.
Modules and Packages: reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.
Introduction to NLTK The Natural Language Toolkit (NLTK) provides: Basic classes for representing data relevant to natural language processing. Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. Standard implementations of each task, which can be combined to solve complex problems.
NLTK: Example Modules nltk.token: processing individual elements of text, such as words or sentences. nltk.probability: modeling frequency distributions and probabilistic systems. nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. nltk.parser: high-level interface for parsing texts. nltk.chartparser: a chart-based implementation of the parser interface. nltk.chunkparser: a regular-expression based surface parser.
NLTK: Top-Level Organization NLTK is organized as a flat hierarchy of packages and modules. Each module provides the tools necessary to address a specific task Modules contain two types of classes: Data-oriented classes are used to represent information relevant to natural language processing. Task-oriented classes encapsulate the resources and methods needed to perform a specific task.
To the First Tutorials Tokens and Tokenization Frequency Distributions
The Token Module It is often useful to think of a text in terms of smaller elements, such as words or sentences. The nltk.token module defines classes for representing and processing these smaller elements. What might be other useful smaller elements?
Tokens and Types The term word can be used in two different ways: To refer to an individual occurrence of a word To refer to an abstract vocabulary item For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items. To avoid confusion use more precise terminology: Word token: an occurrence of a word Word Type: a vocabulary item
Tokens and Types (continued) In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word_type = 'dog' 'dog' >>> my_word_token =Token(my_word_type) ‘dog'@[?] Token member functions include type and loc
Text Locations A text location @ [s:e] specifies a region of a text: s is the start index e is the end index The text location @ [s:e]specifies the text beginning at s, and including everything up to (but not including) the text at e. This definition is consistent with Python slice. Think of indices as appearing between elements: I saw a man 0 1 2 3 4 Shorthand notation when location width = 1.
Text Locations (continued) Indices can be based on different units: character word sentence Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file) Location member functions: start end unit source
Tokenization The simplest way to represent a text is with a single string. Difficult to process text in this format. Often, it is more convenient to work with a list of tokens. The task of converting a text from a single string to a list of tokens is known as tokenization.
Tokenization (continued) Tokenization is harder that it seems I’ll see you in New York. The aluminum-export ban. The simplest approach is to use “graphic words” (i.e., separate words using whitespace) Another approach is to use regular expressions to specify which substrings are valid words. NLTK provides a generic tokenization interface: TokenizerI
TokenizerI Defines a single method, tokenize, which takes a string and returns a list of tokens Tokenize is independent of the level of tokenization and the implementation algorithm
Example from nltk.token import WSTokenizer from nltk.draw.plot import Plot #Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Count up how many times each word length occurs wordlen_count_list = [] for token in tokens: wordlen = len(token.type()) # Add zeros until wordlen_count_list is long enough while wordlen >= len(wordlen_count_list): wordlen_count_list.append(0) # Increment the count for this word length wordlen_count_list[wordlen] += 1 Plot(wordlen_count_list)
Next Tutorial: Probability An experiment is any process which leads to a well-defined outcome A sample is any possible outcome of a given experiment Rolling a die?
Outline Review Basics Probability Experiments and Samples Frequency Distributions Conditional Frequency Distributions
Review: NLTK Goals Classes for NLP data Interfaces for NLP tasks Implementations, easily combined (what is an example?)
What is the relation to Python? Accessing NLTK What is the relation to Python?
Types and Tokens Text Locations Member Functions Words Types and Tokens Text Locations Member Functions
Tokenization TokenizerI Implementations >>> tokenizer = WSTokenizer() >>> tokenizer.tokenize(text_str) ['Hello'@[0w], 'world.'@[1w], 'This'@[2w], 'is'@[3w], 'a'@[4w], 'test'@[5w], 'file.'@[6w]]
Word Length Freq. Distribution Example from nltk.token import WSTokenizer from nltk.probability import SimpleFreqDist # Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Construct a frequency distribution of word lengths wordlen_freqs = SimpleFreqDist() for token in tokens: wordlen_freqs.inc(len(token.type())) # Extract the set of word lengths found in the corpus wordlens = wordlen_freqs.samples()
Frequency Distributions A frequency distribution records the number of times each outcome of an experiment has occurred >>> freq_dist = FreqDist() >>> for token in document: ... freq_dist.inc(token.type()) Constructor, then initialization by storing experimental outcomes
Methods The freq method returns the frequencey of a given sample. We can find the number of times a given sample occured with the count method We can find the total number of sample outcomes recorded by a frequency distribution with the N method The samples method returns a list of all samples that have been recorded as outcomes by a frequency distribution We can find the sample with the greatest number of outcomes with the max method
Examples of Methods >>> freq_dist.count('the') 6 >>> freq_dist.freq('the') 0.012 >>> freq_dist.N() 500 >>> freq_dist.max() ‘the’
Simple Word Length Example >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type())) What is the "outcome" for our experiment?
Simple Word Length Example >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type())) This length is the "outcome" for our experiment, so we use inc() to increment its count in a frequency distribution.
Complex Word Length Example # define vowels as "a", "e", "i", "o", and "u" >>> VOWELS = ('a', 'e', 'i', 'o', 'u') # distribution for words ending in vowels? >>> freq_dist = FreqDist() >>> for token in tokens: ... if token.type()[-1].lower() in VOWELS: ... freq_dist.inc(len(token.type())) What is the condition?
More Complex Example # What is the distribution of word lengths for # words following words that end in vowels? >>> ended_in_vowel = 0 #Did last word end in vowel? >>> freq_dist = FreqDist() >>> for token in tokens: ... if ended_in_vowel: ... Freq_dist.inc(len(token.type())) ... ended_in_vowel=token.type()[-1].lower() in VOWELS
Conditional Frequency Distributions A condition specifies the context in which an experiment is performed A conditional frequency distribution is a collection of frequency distribtuions for the same experiment, run under different conditions The individual frequency distributions are indexed by the condition. NLTK ConditionalFreqDist class >>> cfdist = ConditionalFreqDist() <ConditionalFreqDist with 0 conditions>
Conditional Frequency Distributions (continued) To access the frequency distribution for a condition, use the indexing operator : >>> cfdist['a'] <FreqDist with 0 outcomes> # Record lengths of some words starting with 'a' >>> for word in 'apple and arm'.split(): ... cfdist['a'].inc(len(word)) # How many are 3 characters long? >>> cfdist['a'].freq(3) 0.66667 To list accessed conditions, use the conditions method: >>> cfdist.conditions() ['a']
Example: Conditioning on a Word’s Initial Letter >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> from nltk.draw.plot import Plot # >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist()
Example (continued) # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type()) ... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome) What are the condition and the outcome?
Example (continued) # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type()) ... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome) What are the condition and the outcome? Condition = the initial letter of the token Outcome = its word length
Prediction Prediction is the problem of deciding a likely outcome for a given run of an experiment. To predict the outcome, we first examine a training corpus. Training corpus The context and outcome for each run are known Given a new run, we choose the outcome that occurred most frequently for the context Conditional frequency distribution finds the most frequent occurrrence
Prediction Example: Outline Record each outcome in the training corpus, using the context that the experiment was under as the condition Access the frequency distribution for a given context with the indexing operator Use the max() method to find the most likely outcome
Example: Predicting Words Predict word's type, based on preceding word type >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist() #empty
Example (continued) >>> context = None # The type of the preceding word >>> for token in tokens: ... outcome = token.type() ... cfdist[context].inc(outcome) ... context = token.type()
Example (continued) >>> cfdist['prediction'].max() 'problems' >>> cfdist['problems'].max() 'in' >>> cfdist['in'].max() 'the‘ What are we predicting here?
Example (continued) We predict the most likely word for any context Generation application: >>> word = 'prediction' >>> for i in range(15): ... print word, ... word = cfdist[word].max() prediction problems in the frequency distribution of the frequency distribution of the frequency distribution of
For Next Time HW3 To run NLTK from unixs.cis.pitt.edu, you should add /afs/cs.pitt.edu/projects/nltk/bin to your search path Regular Expressions (J&M handout, NLTK tutorial)