Presentation is loading. Please wait.

Presentation is loading. Please wait.

Python 3 March 15, 2011. NLTK import nltk nltk.download()

Similar presentations


Presentation on theme: "Python 3 March 15, 2011. NLTK import nltk nltk.download()"— Presentation transcript:

1 Python 3 March 15, 2011

2 NLTK import nltk nltk.download()

3 NLTK import nltk from nltk.book import * texts() 1. Look at the lists of available texts

4 NLTK import nltk from nltk.book import * print text1[0:50] 2. Check out what the text1 (Moby Dick) object looks like

5 NLTK import nltk from nltk.book import * print text1[0:50] Looks like a list of word tokens 2. Check out what the text1 (Moby Dick) object looks like

6 NLTK 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10]

7 NLTK import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10] FreqDist is an object defined by NLTK http://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html FreqDist is an object defined by NLTK http://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html Give it a list of word tokens It will be automatically sorted. Print the first 10 keys It will be automatically sorted. Print the first 10 keys 3. Get list of top most frequent word TOKENS

8 NLTK import nltk from nltk.book import * text1.concordance("and") 4. Now get a concordance of the third most common word

9 NLTK import nltk from nltk.book import * text1.concordance("and") concordance is method defined for an nltk text http://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance concordance is method defined for an nltk text http://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window. concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window. 4. Now get a concordance of the third most common word

10 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: String Operations

11 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens String Operations

12 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick String Operations

13 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… String Operations

14 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing String Operations

15 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) String Operations

16 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing Make a new FreqDist with the new list of tokens, call it fd Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) String Operations

17 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it: Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each, with nothing Print it like before Make a new FreqDist with the new list of tokens, call it fd Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) String Operations

18 import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] 5. What if you don't want punctuation in your list? First, simple way to fix it:

19 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way:

20 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Import regular expression module

21 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Compile a regular expression

22 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: The RegEx will match any of the characters inside the brackets

23 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Call the “sub” function associated with the RegEx named punctuation

24 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Replace anything that matches the RegEx with nothing

25 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: As before, do this to each token in the text1 list

26 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Call this new list punctuationRemoved

27 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Get a FreqDist of all tokens with length >1

28 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Print the top 10 word tokens as usual

29 Regular Expressions import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] 6. Now the more complicated, but less typing way: Regular Expressions are Really Powerful and Useful!

30 Quick Diversion import nltk from nltk.book import * import re print fd.keys()[-10:] 7. What if you wanted to see the least common word tokens?

31 Quick Diversion import nltk from nltk.book import * import re print fd.keys()[-10:] 7. What if you wanted to see the least common word tokens? Print the tokens from position -10 to the end

32 Quick Diversion import nltk from nltk.book import * import re print [(k, fd[k]) for k in fd.keys()[0:10]] 8. And what if you wanted to see the frequencies with the words? For each key “k” in the FreqDist, print it and look up its value (fd[k])

33 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example

34 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example Looks similar to the RegEx that matched punctuation before

35 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example This RegEx matches the substring “blue” or the substring “red” or the substring “green”

36 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) 9. Another simple example Here, substitute anything that matches the RegEx with the string “color”

37 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” 10. A more interesting example What if we wanted to identify all of the phone numbers in the string?

38 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) 10. A more interesting example Note that \d is a digit, and {11} matches 11 digits in a row This is a start. Output: ['18005551234'] This is a start. Output: ['18005551234']

39 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) 10. A more interesting example findall will return a list of all substrings of myString that match the RegEx

40 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) 10. A more interesting example Also will need to know: “?” will match 0 or 1 repetitions of the previous element Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html

41 Back to Regular Expressions import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'') print phoneNumbersRegEx.findall(myString) 10. A more interesting example Answer is here, but let’s derive it together


Download ppt "Python 3 March 15, 2011. NLTK import nltk nltk.download()"

Similar presentations


Ads by Google