Python for NLP and the Natural Language Toolkit

Slides:



Advertisements
Similar presentations
1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.
Advertisements

Programming for Linguists
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004.
Data Structures Introduction. What is data? (Latin) Plural of datum = something given.
Software Lifecycle A series of steps through which a software product progresses Lifetimes vary from days to months to years Consists of –people –overall.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
ELN – Natural Language Processing Giuseppe Attardi
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html Natural Language Toolkit.
February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.
JSP Standard Tag Library
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
สาขาวิชาเทคโนโลยี สารสนเทศ คณะเทคโนโลยีสารสนเทศ และการสื่อสาร.
Deep Learning with Python. 파이썬 (python) 이란 ? 1991 년 Guido van Rossum 이 발표한 인터프리터 언어 Google 의 3 대 개발언어 (C/C++, Java, Python)
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
1 CSC 222: Object-Oriented Programming Spring 2012 Object-oriented design  example: word frequencies w/ parallel lists  exception handling  System.out.format.
New Perspectives on XML, 2nd Edition
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Built-in Data Structures in Python An Introduction.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Database Management Systems (DBMS)
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
LECTURE 2 Python Basics. MODULES So, we just put together our first real Python program. Let’s say we store this program in a file called fib.py. We have.
PROGRAMMING USING PYTHON LANGUAGE ASSIGNMENT 1. INSTALLATION OF RASPBERRY NOOB First prepare the SD card provided in the kit by loading an Operating System.
IS 350 Course Introduction. Slide 2 Objectives Identify the steps performed in the software development life cycle Describe selected tools used to design.
Introduction toData structures and Algorithms
Microsoft Foundation Classes MFC
Introduction to MarcEdit
CSCI-235 Micro-Computer Applications
Containers and Lists CIS 40 – Introduction to Programming in Python
Text Based Information Retrieval
CS 100: Roadmap to Computing
CMPT 120 Topic: Python strings.
11.1 The Concept of Abstraction
Natural Language Processing (NLP)
Trees.
Databases.
Text Analytics Giuseppe Attardi Università di Pisa
Introduction to Classes
Chapter 6 Methods: A Deeper Look
Chapter 9 Structuring System Requirements: Logic Modeling
Programming Fundamentals (750113) Ch1. Problem Solving
Fundamentals of Python: First Programs
Trees.
Introduction to Data Structure
CSC 222: Object-Oriented Programming Spring 2013
Introduction to Computer Science
Chapter 9 Structuring System Requirements: Logic Modeling
Natural Language Processing (NLP)
Chapter 17 JavaScript Arrays
CSA2050: Introduction to Computational Linguistics
CSC 222: Object-Oriented Programming Spring 2012
Lecture 8 Object Oriented Programming (OOP)
Introduction to Computer Science
Natural Language Processing (NLP)
Presentation transcript:

Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)

Outline Review: Introduction to NLP (knowledge of language, ambiguity, representations and algorithms, applications) HW 2 discussion Tutorials: Basics, Probability

Python and Natural Language Processing Python is a great language for NLP: Simple Easy to debug: Exceptions Interpreted language Easy to structure Modules Object oriented programming Powerful string manipulation

Modules and Packages Python modules “package program code and data for reuse.” (Lutz) Similar to library in C, package in Java. Python packages are hierarchical modules (i.e., modules that contain other modules). Three commands for accessing modules: import from…import reload

Modules and Packages: import The import command loads a module: # Load the regular expression module >>> import re To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

Modules and Packages from…import The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str)

Import vs. from…import Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. from…import Puts module functions and user functions together. More convenient names. Does not work with reload.

Modules and Packages: reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

Introduction to NLTK The Natural Language Toolkit (NLTK) provides: Basic classes for representing data relevant to natural language processing. Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. Standard implementations of each task, which can be combined to solve complex problems.

NLTK: Example Modules nltk.token: processing individual elements of text, such as words or sentences. nltk.probability: modeling frequency distributions and probabilistic systems. nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. nltk.parser: high-level interface for parsing texts. nltk.chartparser: a chart-based implementation of the parser interface. nltk.chunkparser: a regular-expression based surface parser.

NLTK: Top-Level Organization NLTK is organized as a flat hierarchy of packages and modules. Each module provides the tools necessary to address a specific task Modules contain two types of classes: Data-oriented classes are used to represent information relevant to natural language processing. Task-oriented classes encapsulate the resources and methods needed to perform a specific task.

To the First Tutorials Tokens and Tokenization Frequency Distributions

The Token Module It is often useful to think of a text in terms of smaller elements, such as words or sentences. The nltk.token module defines classes for representing and processing these smaller elements. What might be other useful smaller elements?

Tokens and Types The term word can be used in two different ways: To refer to an individual occurrence of a word To refer to an abstract vocabulary item For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items. To avoid confusion use more precise terminology: Word token: an occurrence of a word Word Type: a vocabulary item

Tokens and Types (continued) In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word_type = 'dog' 'dog' >>> my_word_token =Token(my_word_type) ‘dog'@[?] Token member functions include type and loc

Text Locations A text location @ [s:e] specifies a region of a text: s is the start index e is the end index The text location @ [s:e]specifies the text beginning at s, and including everything up to (but not including) the text at e. This definition is consistent with Python slice. Think of indices as appearing between elements: I saw a man 0 1 2 3 4 Shorthand notation when location width = 1.

Text Locations (continued) Indices can be based on different units: character word sentence Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file) Location member functions: start end unit source

Tokenization The simplest way to represent a text is with a single string. Difficult to process text in this format. Often, it is more convenient to work with a list of tokens. The task of converting a text from a single string to a list of tokens is known as tokenization.

Tokenization (continued) Tokenization is harder that it seems I’ll see you in New York. The aluminum-export ban. The simplest approach is to use “graphic words” (i.e., separate words using whitespace) Another approach is to use regular expressions to specify which substrings are valid words. NLTK provides a generic tokenization interface: TokenizerI

TokenizerI Defines a single method, tokenize, which takes a string and returns a list of tokens Tokenize is independent of the level of tokenization and the implementation algorithm

Example from nltk.token import WSTokenizer from nltk.draw.plot import Plot #Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Count up how many times each word length occurs wordlen_count_list = [] for token in tokens: wordlen = len(token.type()) # Add zeros until wordlen_count_list is long enough while wordlen >= len(wordlen_count_list): wordlen_count_list.append(0) # Increment the count for this word length wordlen_count_list[wordlen] += 1 Plot(wordlen_count_list)

Next Tutorial: Probability An experiment is any process which leads to a well-defined outcome A sample is any possible outcome of a given experiment Rolling a die?

Outline Review Basics Probability Experiments and Samples Frequency Distributions Conditional Frequency Distributions

Review: NLTK Goals Classes for NLP data Interfaces for NLP tasks Implementations, easily combined (what is an example?)

What is the relation to Python? Accessing NLTK What is the relation to Python?

Types and Tokens Text Locations Member Functions Words Types and Tokens Text Locations Member Functions

Tokenization TokenizerI Implementations >>> tokenizer = WSTokenizer() >>> tokenizer.tokenize(text_str) ['Hello'@[0w], 'world.'@[1w], 'This'@[2w], 'is'@[3w], 'a'@[4w], 'test'@[5w], 'file.'@[6w]]

Word Length Freq. Distribution Example from nltk.token import WSTokenizer from nltk.probability import SimpleFreqDist # Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Construct a frequency distribution of word lengths wordlen_freqs = SimpleFreqDist() for token in tokens: wordlen_freqs.inc(len(token.type())) # Extract the set of word lengths found in the corpus wordlens = wordlen_freqs.samples()

Frequency Distributions A frequency distribution records the number of times each outcome of an experiment has occurred >>> freq_dist = FreqDist() >>> for token in document: ... freq_dist.inc(token.type()) Constructor, then initialization by storing experimental outcomes

Methods The freq method returns the frequencey of a given sample. We can find the number of times a given sample occured with the count method We can find the total number of sample outcomes recorded by a frequency distribution with the N method The samples method returns a list of all samples that have been recorded as outcomes by a frequency distribution We can find the sample with the greatest number of outcomes with the max method

Examples of Methods >>> freq_dist.count('the') 6 >>> freq_dist.freq('the') 0.012 >>> freq_dist.N() 500 >>> freq_dist.max() ‘the’

Simple Word Length Example >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type())) What is the "outcome" for our experiment?

Simple Word Length Example >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type())) This length is the "outcome" for our experiment, so we use inc() to increment its count in a frequency distribution.

Complex Word Length Example # define vowels as "a", "e", "i", "o", and "u" >>> VOWELS = ('a', 'e', 'i', 'o', 'u') # distribution for words ending in vowels? >>> freq_dist = FreqDist() >>> for token in tokens: ... if token.type()[-1].lower() in VOWELS: ... freq_dist.inc(len(token.type())) What is the condition?

More Complex Example # What is the distribution of word lengths for # words following words that end in vowels? >>> ended_in_vowel = 0 #Did last word end in vowel? >>> freq_dist = FreqDist() >>> for token in tokens: ... if ended_in_vowel: ... Freq_dist.inc(len(token.type())) ... ended_in_vowel=token.type()[-1].lower() in VOWELS

Conditional Frequency Distributions A condition specifies the context in which an experiment is performed A conditional frequency distribution is a collection of frequency distribtuions for the same experiment, run under different conditions The individual frequency distributions are indexed by the condition. NLTK ConditionalFreqDist class >>> cfdist = ConditionalFreqDist() <ConditionalFreqDist with 0 conditions>

Conditional Frequency Distributions (continued) To access the frequency distribution for a condition, use the indexing operator : >>> cfdist['a'] <FreqDist with 0 outcomes> # Record lengths of some words starting with 'a' >>> for word in 'apple and arm'.split(): ... cfdist['a'].inc(len(word)) # How many are 3 characters long? >>> cfdist['a'].freq(3) 0.66667 To list accessed conditions, use the conditions method: >>> cfdist.conditions() ['a']

Example: Conditioning on a Word’s Initial Letter >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> from nltk.draw.plot import Plot # >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist()

Example (continued) # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type()) ... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome) What are the condition and the outcome?

Example (continued) # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type()) ... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome) What are the condition and the outcome? Condition = the initial letter of the token Outcome = its word length

Prediction Prediction is the problem of deciding a likely outcome for a given run of an experiment. To predict the outcome, we first examine a training corpus. Training corpus The context and outcome for each run are known Given a new run, we choose the outcome that occurred most frequently for the context Conditional frequency distribution finds the most frequent occurrrence

Prediction Example: Outline Record each outcome in the training corpus, using the context that the experiment was under as the condition Access the frequency distribution for a given context with the indexing operator Use the max() method to find the most likely outcome

Example: Predicting Words Predict word's type, based on preceding word type >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist() #empty

Example (continued) >>> context = None # The type of the preceding word >>> for token in tokens: ... outcome = token.type() ... cfdist[context].inc(outcome) ... context = token.type()

Example (continued) >>> cfdist['prediction'].max() 'problems' >>> cfdist['problems'].max() 'in' >>> cfdist['in'].max() 'the‘ What are we predicting here?

Example (continued) We predict the most likely word for any context Generation application: >>> word = 'prediction' >>> for i in range(15): ... print word, ... word = cfdist[word].max() prediction problems in the frequency distribution of the frequency distribution of the frequency distribution of

For Next Time HW3 To run NLTK from unixs.cis.pitt.edu, you should add /afs/cs.pitt.edu/projects/nltk/bin to your search path Regular Expressions (J&M handout, NLTK tutorial)