Python 3 March 15, 2011. NLTK import nltk nltk.download()

Slides:



Advertisements
Similar presentations
Chungnam National University DataBase System Lab
Advertisements

Introduction to LISP Programming of Pathway Tools Queries and Updates.
NETLOGO LISTS Or… James says, fput THIS!. What are LISTS? Shopping list Address list DNA sequences Any collection of similar or dissimilar things Often.
Sequence of characters Generalized form Expresses Pattern of strings in a Generalized notation.
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Programming for Linguists
23-Aug-14 HTML/XHTML Forms. 2 What are forms? is just another kind of XHTML/HTML tag Forms are used to create (rather primitive) GUIs on Web pages Usually.
24-Aug-14 HTML Forms. 2 What are forms? is just another kind of HTML tag HTML forms are used to create (rather primitive) GUIs on Web pages Usually the.
Input and Output Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Container Types in Python
Internet Services and Web Authoring (CSET 226) Lecture # 5 HyperText Markup Language (HTML) 1.
1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.
Programming for Linguists
Building a Web Crawler in Python Frank McCown Harding University Spring 2013 This work is licensed under a Creative Commons Attribution-NonCommercial-
Python Basics: Statements Expressions Loops Strings Functions.
2-1. Today’s Lecture Review Chapter 4 Go over exercises.
Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Classes and Objects Object Oriented Programming.
Python regular expressions. “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Self Check 1.Which are the most commonly used number types in Java? 2.Suppose x is a double. When does the cast (long) x yield a different result from.
Prepare for next time No need to buy the book – Free online at Read Chapter 1 –
String and Lists Dr. Benito Mendoza. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list List.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
Chapter 6 C Arrays Acknowledgment The notes are adapted from those provided by Deitel & Associates, Inc. and Pearson Education Inc. Arrays are data structures.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Binary Search Trees continued Trees Draw the BST Insert the elements in this order 50, 70, 30, 37, 43, 81, 12, 72, 99 2.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
Formal Language Theory. Homework Read documentation on Graphviz – –
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Programming Languages Meeting 13 December 2/3, 2014.
Hossain Shahriar Announcement and reminder! Tentative date for final exam need to be fixed! Topics to be covered in this lecture(s)
Collecting Things Together - Lists 1. We’ve seen that Python can store things in memory and retrieve, using names. Sometime we want to store a bunch of.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Searching and Regular Expressions. Proteins 20 amino acids Interesting structures beta barrel, greek key motif, EF hand... Bind, move, catalyze, recognize,
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
Chapter 6(6.5~) Concordance lines and corpus linguistics Parallel embedded system design lab 이청용.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
An urn contains 1 green, 2 red, and 3 blue marbles. Draw two without replacement. 1/6 2/6 3/6 2/5 3/5 1/5 3/5 1/5 2/5 2/30 3/30 2/30 6/30 3/30 6/30.
OCR Computing GCSE © Hodder Education 2013 Slide 1 OCR GCSE Computing Python programming 3: Built-in functions.
Today… The for loop. Introducing the Turtle! Loops and Drawing. Winter 2016CISC101 - Prof. McLeod1.
String and Lists Dr. José M. Reyes Álamo. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list.
Python for NLP and the Natural Language Toolkit
String and Lists Dr. José M. Reyes Álamo.
Regular Expressions Upsorn Praphamontripong CS 1110
CS 330 Class 7 Comments on Exam Programming plan for today:
Concepts of Programming Languages
Lists Part 1 Taken from notes by Dr. Neil Moore & Dr. Debby Keen
Lists Part 1 Taken from notes by Dr. Neil Moore
LING 388: Computers and Language
String and Lists Dr. José M. Reyes Álamo.
CS 1111 Introduction to Programming Fall 2018
Programming Languages
Lists Part 1 Taken from notes by Dr. Neil Moore
12.4 p 471 a) double[] number = {1.05, 2.05, 3.05, 4.05, 5.05};
String Processing 1 MIS 3406 Department of MIS Fox School of Business
Nate Brunelle Today: Regular Expressions
Week 13 - Wednesday CS221.
Presentation transcript:

Python 3 March 15, 2011

NLTK import nltk nltk.download()

NLTK import nltk from nltk.book import * texts() 1. Look at the lists of available texts

NLTK import nltk from nltk.book import * print text1[0:50] 2. Check out what the text1 (Moby Dick) object looks like

NLTK import nltk from nltk.book import * print text1[0:50] Looks like a list of word tokens 2. Check out what the text1 (Moby Dick) object looks like

NLTK 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10]

NLTK import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10] FreqDist is an object defined by NLTK FreqDist is an object defined by NLTK Give it a list of word tokens It will be automatically sorted. Print the first 10 keys It will be automatically sorted. Print the first 10 keys 3. Get list of top most frequent word TOKENS

NLTK import nltk from nltk.book import * text1.concordance("and") 4. Now get a concordance of the third most common word

NLTK import nltk from nltk.book import * text1.concordance("and") concordance is method defined for an nltk text concordance is method defined for an nltk text concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window. concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window. 4. Now get a concordance of the third most common word

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing Make a new FreqDist with the new list of tokens, call it fd Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing Print it like before Make a new FreqDist with the new list of tokens, call it fd Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) String Operations

import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it:

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way:

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Import regular expression module

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Compile a regular expression

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: The RegEx will match any of the characters inside the brackets

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Call the “sub” function associated with the RegEx named punctuation

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Replace anything that matches the RegEx with nothing

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: As before, do this to each token in the text1 list

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Call this new list punctuationRemoved

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Get a FreqDist of all tokens with length >1

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Print the top 10 word tokens as usual

Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Regular Expressions are Really Powerful and Useful!

Quick Diversion import nltk from nltk.book import * import re print fd.keys()[-10:] 7. What if you wanted to see the least common word tokens?

Quick Diversion import nltk from nltk.book import * import re print fd.keys()[-10:] 7. What if you wanted to see the least common word tokens? Print the tokens from position -10 to the end

Quick Diversion import nltk from nltk.book import * import re print [(k, fd[k]) for k in fd.keys()[0:10]] 8. And what if you wanted to see the frequencies with the words? For each key “k” in the FreqDist, print it and look up its value (fd[k])

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example Looks similar to the RegEx that matched punctuation before

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example This RegEx matches the substring “blue” or the substring “red” or the substring “green”

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example Here, substitute anything that matches the RegEx with the string “color”

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” 10. A more interesting example What if we wanted to identify all of the phone numbers in the string?

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) 10. A more interesting example Note that \d is a digit, and {11} matches 11 digits in a row This is a start. Output: [' '] This is a start. Output: [' ']

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) 10. A more interesting example findall will return a list of all substrings of myString that match the RegEx

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) 10. A more interesting example Also will need to know: “?” will match 0 or 1 repetitions of the previous element Note: find lots more information on regular expressions here: Note: find lots more information on regular expressions here:

Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is and my friend's phone number is (800) and my cell number is You could also call me at if you'd like.” phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'') print phoneNumbersRegEx.findall(myString) 10. A more interesting example Answer is here, but let’s derive it together