TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization 29-Oct-2014NLP, Prof. Howard, Tulane University 2 The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters 6. Control 6. Control
Open Spyder 29-Oct NLP, Prof. Howard, Tulane University
>>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> wubFD = FreqDist(word.lower() for word in text) >>> wubFD.plot() Review of dictionaries & FreqDist 29-Oct NLP, Prof. Howard, Tulane University
29-Oct-2014NLP, Prof. Howard, Tulane University 5 wubFD.plot(50)
29-Oct-2014NLP, Prof. Howard, Tulane University 6 wubFD.plot(50, cumulative=True)
Want to graph frequency by rank, logarithmicly 29-Oct-2014NLP, Prof. Howard, Tulane University 7 Rank Frequency
Tuples, cf. freqdist.items(k,v) 1. >>> singleton = (1) 2. >>> double = (1,2) 3. >>> triple = (1,2,3) 4. >>> quadruple = (1,2,3,4) 5. >>> singleton 6. >>> singleton[0] 7. >>> double[0] 8. >>> double[1] 9. >>> double[2] 10. >>> triple[3] 11. >>> quadruple[4] 29-Oct-2014NLP, Prof. Howard, Tulane University 8
Range() 1. >>> range(5) 2. [0, 1, 2, 3, 4] 3. # how would you get [1, 2, 3, 4, 5]? 4. >>> range(1,6) # range(0+1,5+1) 5. [1, 2, 3, 4, 5] 29-Oct-2014NLP, Prof. Howard, Tulane University 9
How to print values on a logarithmic scale The task is to extract the values/outcomes from the frequency distribution in order to graph them against their rank without any words. The values must be sorted from high to low in order to reflect their rank order. 1. >>> freq = [v for (k,v) in wubFD.items()] 2. >>> freq = sorted(freq,reverse=True) 3. >>> rank = range(1,len(freq)+1) 4. >>> import matplotlib.pyplot as plt 5. >>> plt.loglog(rank,freq) 6. >>> plt.title("Logarithmic rank-frequency plot of 'Beyond Lies the Wub'") 7. >>> plt.xlabel('Rank') 8. >>> plt.ylabel('Frequency') 29-Oct-2014NLP, Prof. Howard, Tulane University 10
The plot 29-Oct-2014NLP, Prof. Howard, Tulane University 11
Homework Make a logarithmic rank-frequency plot of the words in the vampire novel. Is it straighter? 29-Oct-2014NLP, Prof. Howard, Tulane University 12
The problem of the most frequent words 29-Oct NLP, Prof. Howard, Tulane University
What to do about the most frequent words >>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 29-Oct-2014NLP, Prof. Howard, Tulane University 14
Conditional frequency distribution 29-Oct NLP, Prof. Howard, Tulane University
19-mar-14SPAN Harry Howard - Tulane University 16 A corpus with genres or categories The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: 1. >>> from nltk.corpus import brown 2. >>> brown.categories() 3. ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance','science_fiction'] 4. >>> brown.words(categories='news') 5. ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...]
19-mar-14SPAN Harry Howard - Tulane University 17 Two simultaneous tallies Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'),...] These can be understood as (condition, sample).
ConditionalFreqDist 1. >>> from nltk.probability import ConditionalFreqDist 2. >>> cat = ['news', 'romance'] 3. >>> catWord = [(c,w) 4. for c in cat 5. for w in brown.words(categories=c)] 6. >>> cfd=ConditionalFreqDist(catWord) 29-Oct-2014NLP, Prof. Howard, Tulane University 18
Check results 1. >>> len(catWords) >>> catWords[:4] 4. [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] 5. >>> catWords[-4:] 6. [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] 7. >>> cfd >>> cfd.conditions() 10. ['news', 'romance'] 11. >>> cfd['news'] >>> cfd['romance'] >>> cfd['romance']['could'] >>> list(cfd['romance']) 18. [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she',...] 29-Oct-2014NLP, Prof. Howard, Tulane University 19
Q7 Conditional frequency Next time 29-Oct-2014NLP, Prof. Howard, Tulane University 20