Download presentation
Presentation is loading. Please wait.
Published byKelley Cain Modified over 9 years ago
1
TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization 03-Nov-2014NLP, Prof. Howard, Tulane University 2 http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/ Chapter numbering 3.7. How to deal with non-English characters 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters 6. Control 6. Control
3
Final project 03-Nov-2014NLP, Prof. Howard, Tulane University 3
4
Open Spyder 03-Nov-2014 4 NLP, Prof. Howard, Tulane University
5
Review 03-Nov-2014 5 NLP, Prof. Howard, Tulane University
6
ConditionalFreqDist 1. >>> from nltk.corpus import brown 2. >>> from nltk.probability import ConditionalFreqDist 3. >>> cat = ['news', 'romance'] 4. >>> catWord = [(c,w) 5. for c in cat 6. for w in brown.words(categories=c)] 7. >>> cfd=ConditionalFreqDist(catWord) 03-Nov-2014NLP, Prof. Howard, Tulane University 6
7
Conditional frequency distribution 03-Nov-2014 7 NLP, Prof. Howard, Tulane University
8
03-Nov-2014NLP, Prof. Howard, Tulane University 8 A more interesting example cancouldmaymightmustwill news9386663850389 religion825978125471 hobbies268581312283264 sci fi1649412816 romance7419311514543 humor163088913
9
Conditions = categories, sample = modal verbs 1. # from nltk.corpus import brown 2. # from nltk.probability import ConditionalFreqDist 3. >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] 4. >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] 5. >>> catWord = [(c,w) 6. for c in cat 7. for w in brown.words(categories=c) 8. if w in mod] 9. >>> cfd = ConditionalFreqDist(catWord) 10. >>> cfd.tabulate() 11. >>> cfd.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 9
10
cfd.tabulate() can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 03-Nov-2014NLP, Prof. Howard, Tulane University 10
11
cfd.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 11
12
03-Nov-2014NLP, Prof. Howard, Tulane University 12 Another example The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses: 1. >>> from nltk.corpus import inaugural 2. >>> inaugural.fileids() 3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt',..., '2009-Obama.txt']
13
03-Nov-2014NLP, Prof. Howard, Tulane University 13 cfd2.plot()
14
First try 1. from nltk.corpus import inaugural 2. from nltk.probability import ConditionalFreqDist 3. keys = ['america', 'citizen'] 4. keyYear = [(w, title[:4]) 5. for title in inaugural.fileids() 6. for w in inaugural.words(title) 7. if w.lower() in keys] 8. cfd2 = ConditionalFreqDist(keyYear) 9. cfd2.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 14
15
03-Nov-2014NLP, Prof. Howard, Tulane University 15 cfd2.plot()
16
Second try 1. from nltk.corpus import inaugural 2. from nltk.probability import ConditionalFreqDist 3. keys = ['america', 'citizen'] 4. keyYear = [(key, title[:4]) 5. for title in inaugural.fileids() 6. for w in inaugural.words(title) 7. for k in keys 8. if w.lower().startswith(k)] 9. cfd3 = ConditionalFreqDist(keyYear) 10. cfd3.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 16
17
dfc3.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 17
18
Stemming 03-Nov-2014NLP, Prof. Howard, Tulane University 18
19
Third try 1. from nltk.stem.snowball import EnglishStemmer 2. stemmer = EnglishStemmer() 3. from nltk.corpus import inaugural 4. from nltk.probability import ConditionalFreqDist 5. keys = ['america', 'citizen'] 6. keyYear = [(w, title[:4]) 7. for title in inaugural.fileids() 8. for w in inaugural.words(title) 9. if stemmer.stem(w) in keys] 10. cfd4 = ConditionalFreqDist(keyYear) 11. cfd4.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 19
20
cfd4.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 20
21
Twitter Next time 03-Nov-2014NLP, Prof. Howard, Tulane University 21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.