Text statistics 6 Day 29 - 11/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 03-Nov-2014
Open Spyder NLP, Prof. Howard, Tulane University 03-Nov-2014
Review of dictionaries & FreqDist NLP, Prof. Howard, Tulane University 03-Nov-2014
The plot NLP, Prof. Howard, Tulane University 03-Nov-2014
What to do about the most frequent words >>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 03-Nov-2014
Usage >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> from nltk.corpus import stopwords >>> wubFD = FreqDist(word.lower() for word in text if word.lower() not in stopwords.words('english')) >>> wubFD.plot(50) NLP, Prof. Howard, Tulane University 03-Nov-2014
Plot without stopwords NLP, Prof. Howard, Tulane University 03-Nov-2014
A corpus with genres or categories The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014
Two simultaneous tallies Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] These can be understood as (condition, sample). NLP, Prof. Howard, Tulane University 03-Nov-2014
ConditionalFreqDist # from nltk.corpus import brown >>> from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'romance'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c)] >>> cfd=ConditionalFreqDist(catWord) NLP, Prof. Howard, Tulane University 03-Nov-2014
Conditional frequency distribution NLP, Prof. Howard, Tulane University 03-Nov-2014
Check results NLP, Prof. Howard, Tulane University 03-Nov-2014 >>> len(catWords) 170576 >>> catWords[:4] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> catWords[-4:] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() ['news', 'romance'] >>> cfd['news'] <FreqDist with 100554 outcomes> >>> cfd['romance'] <FreqDist with 70022 outcomes> >>> cfd['romance']['could'] 193 >>> list(cfd['romance']) [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014
A more interesting example can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 sci fi 16 49 4 8 romance 74 193 11 51 45 43 humor 30 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014
Conditions = categories, sample = modal verbs # from nltk.corpus import brown # from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c) if w in mod] >>> cfd = ConditionalFreqDist(catWord) >>> cfd.tabulate() >>> cfd.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014
dfc.tabulate() can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014
dfc.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014
Another example The task is to find the frequency of 'america' and 'citizen' in NLTK's corpus of presedential inaugural addresses: >>> from nltk.corpus import inaugural >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt'] NLP, Prof. Howard, Tulane University 03-Nov-2014
The code # from nltk.corpus import inaugural # from nltk.probability import ConditionalFreqDist >>> keys = ['america', 'citizen'] >>> keyYear = [(w, title[:4]) for title in inaugural.fileids() for w in inaugural.words(title) for w in keys if w.lower().startswith(keys)] >>> cfd2 = ConditionalFreqDist(keyYear) >>> cfd2.tabulate() >>> cfd2.plot() annos = [fileid[:4] for fileid in inaugural.fileids()] NLP, Prof. Howard, Tulane University 03-Nov-2014
cfd2.tabulate() cannot be read 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905 1909 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 america 2 1 8 0 1 0 1 1 2 0 0 2 2 7 0 2 2 3 2 1 0 0 1 2 4 6 9 9 7 0 12 4 24 11 12 2 5 12 2 4 6 7 7 10 10 23 5 16 21 11 33 31 20 30 15 citizen 5 1 6 7 10 1 4 14 15 3 2 3 7 38 11 2 4 7 7 0 5 3 9 9 13 12 10 10 2 1 6 3 6 5 12 1 2 1 1 1 7 0 5 4 1 1 0 3 6 3 2 10 11 7 2 NLP, Prof. Howard, Tulane University 03-Nov-2014
It's something like … 1789 1793 1797 1801 1805… america 2 1 8 citizen citizen 5 9 7 10 NLP, Prof. Howard, Tulane University 03-Nov-2014
Same thing, changing axes america citizen 1861 2 7 1941 12 1 1789 2 5 1865 1 0 1945 2 1 1793 1 1 1869 0 5 1949 4 1 1797 8 6 1873 0 3 1953 6 7 1801 0 7 1877 1 9 1957 7 0 1805 1 10 1881 2 9 1961 7 5 1809 0 1 1885 4 13 1965 10 4 1813 1 4 1889 6 12 1969 10 1 1817 1 14 1893 9 10 1973 23 1 1821 2 15 1897 9 10 1977 5 0 1825 0 3 1901 7 2 1981 16 3 1829 0 2 1905 0 1 1985 21 6 1833 2 3 1909 12 6 1989 11 3 1837 2 7 1917 4 3 1993 33 2 1841 7 38 1921 24 6 1997 31 10 1845 0 11 1925 11 5 2001 20 11 1849 2 2 1929 12 12 2005 30 7 1853 2 4 1933 2 1 2009 15 2 1857 3 7 1937 5 2 NLP, Prof. Howard, Tulane University 03-Nov-2014
dfc2.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014
Next time Q8 Twitter maybe NLP, Prof. Howard, Tulane University 03-Nov-2014