Download presentation
Presentation is loading. Please wait.
Published byJuniper Rich Modified over 6 years ago
1
Text statistics 6 Day 29 - 11/03/14
LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 03-Nov-2014
3
Open Spyder NLP, Prof. Howard, Tulane University 03-Nov-2014
4
Review of dictionaries & FreqDist
NLP, Prof. Howard, Tulane University 03-Nov-2014
5
The plot NLP, Prof. Howard, Tulane University 03-Nov-2014
6
What to do about the most frequent words
>>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 03-Nov-2014
7
Usage >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> from nltk.corpus import stopwords >>> wubFD = FreqDist(word.lower() for word in text if word.lower() not in stopwords.words('english')) >>> wubFD.plot(50) NLP, Prof. Howard, Tulane University 03-Nov-2014
8
Plot without stopwords
NLP, Prof. Howard, Tulane University 03-Nov-2014
9
A corpus with genres or categories
The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014
10
Two simultaneous tallies
Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] These can be understood as (condition, sample). NLP, Prof. Howard, Tulane University 03-Nov-2014
11
ConditionalFreqDist # from nltk.corpus import brown
>>> from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'romance'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c)] >>> cfd=ConditionalFreqDist(catWord) NLP, Prof. Howard, Tulane University 03-Nov-2014
12
Conditional frequency distribution
NLP, Prof. Howard, Tulane University 03-Nov-2014
13
Check results NLP, Prof. Howard, Tulane University 03-Nov-2014
>>> len(catWords) 170576 >>> catWords[:4] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> catWords[-4:] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() ['news', 'romance'] >>> cfd['news'] <FreqDist with outcomes> >>> cfd['romance'] <FreqDist with outcomes> >>> cfd['romance']['could'] 193 >>> list(cfd['romance']) [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014
14
A more interesting example
can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 sci fi 16 49 4 8 romance 74 193 11 51 45 43 humor 30 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014
15
Conditions = categories, sample = modal verbs
# from nltk.corpus import brown # from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c) if w in mod] >>> cfd = ConditionalFreqDist(catWord) >>> cfd.tabulate() >>> cfd.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014
16
dfc.tabulate() can could may might must will news religion hobbies science_fiction romance humor NLP, Prof. Howard, Tulane University 03-Nov-2014
17
dfc.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014
18
Another example The task is to find the frequency of 'america' and 'citizen' in NLTK's corpus of presedential inaugural addresses: >>> from nltk.corpus import inaugural >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt'] NLP, Prof. Howard, Tulane University 03-Nov-2014
19
The code # from nltk.corpus import inaugural
# from nltk.probability import ConditionalFreqDist >>> keys = ['america', 'citizen'] >>> keyYear = [(w, title[:4]) for title in inaugural.fileids() for w in inaugural.words(title) for w in keys if w.lower().startswith(keys)] >>> cfd2 = ConditionalFreqDist(keyYear) >>> cfd2.tabulate() >>> cfd2.plot() annos = [fileid[:4] for fileid in inaugural.fileids()] NLP, Prof. Howard, Tulane University 03-Nov-2014
20
cfd2.tabulate() cannot be read
america citizen NLP, Prof. Howard, Tulane University 03-Nov-2014
21
It's something like … 1789 1793 1797 1801 1805… america 2 1 8 citizen
citizen 5 9 7 10 NLP, Prof. Howard, Tulane University 03-Nov-2014
22
Same thing, changing axes
america citizen NLP, Prof. Howard, Tulane University 03-Nov-2014
23
dfc2.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014
24
Next time Q8 Twitter maybe NLP, Prof. Howard, Tulane University
03-Nov-2014
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.