LING 3820 & 6820 Natural Language Processing Harry Howard

Slides:

Advertisements

Similar presentations

High Frequency Words List A Group 1

Advertisements

Programming for Linguists

Text Corpora and Lexical Resources Chapter 2 of Natural Language Processing with Python.

TEXT STATISTICS 1 DAY /20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 7 DAY /05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

100 Most Common Words.

Reflexive Pronouns Grammar Test. Reflexive Pronouns Select the best reflexive pronoun to complete each sentence.

TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.

SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Grammar Fix Part 1. Pronouns What are they? Words that take the place of a noun How many can you think of? There are many, but they fall in to Five main.

TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.

REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

The Grammar Business © 2001 Glenrothes College The Grammar Business Reflexive pronouns: when not to use them.

CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter.

CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Grammar Unit 1: Parts of Speech

First Grade English High Frequency Words

First 100 high frequency words

LING 3820 & 6820 Natural Language Processing Harry Howard

P.A.V.P.A.N.I.C. P.O.S. Review Pronouns and Adverbs.

the and a to said in he I of it was you they on she is for at his but

I and the was to a in it of GO FISH GO FISH GO FISH GO FISH GO FISH

Pronouns When you want sentences to flow smoothly, avoiding repetition, you will need to use pronouns in place of nouns.

the people Write it down. by the water.

Pronouns Sandra Boyd.

Text Analytics Giuseppe Attardi Università di Pisa

Fry’s High Frequency Words

LING 388: Computers and Language

Pronouns Mrs. Smith.

Regular expressions 2 Day /23/16

Core Concepts Lecture 1 Lexical Frequency.

control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing

LING 3820 & 6820 Natural Language Processing Harry Howard

Spoken English Ms El-Hendi.

THE PARTS OF SPEECH Created by Cindy Leibel

Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing

NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing

Regular expressions 3 Day /26/16

The of and to in is you that it he for was.

PUSD High Frequency Word List: First Grade

Computation with strings 4 Day 5 - 9/09/16

Trick Words Level 1 Press space bar to begin and then again after student has read each word.

Two other people.

High Frequency Words Set #1.

the you are to was they of that as in for I and it with is on my a he

1st GRADE SIGHT WORDS.

Write the word..

LING 388: Computers and Language

Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing

Presentation transcript:

Text statistics 6 Day 29 - 11/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 03-Nov-2014

Open Spyder NLP, Prof. Howard, Tulane University 03-Nov-2014

Review of dictionaries & FreqDist NLP, Prof. Howard, Tulane University 03-Nov-2014

The plot NLP, Prof. Howard, Tulane University 03-Nov-2014

What to do about the most frequent words >>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 03-Nov-2014

Usage >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> from nltk.corpus import stopwords >>> wubFD = FreqDist(word.lower() for word in text if word.lower() not in stopwords.words('english')) >>> wubFD.plot(50) NLP, Prof. Howard, Tulane University 03-Nov-2014

Plot without stopwords NLP, Prof. Howard, Tulane University 03-Nov-2014

A corpus with genres or categories The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014

Two simultaneous tallies Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] These can be understood as (condition, sample). NLP, Prof. Howard, Tulane University 03-Nov-2014

ConditionalFreqDist # from nltk.corpus import brown >>> from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'romance'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c)] >>> cfd=ConditionalFreqDist(catWord) NLP, Prof. Howard, Tulane University 03-Nov-2014

Conditional frequency distribution NLP, Prof. Howard, Tulane University 03-Nov-2014

Check results NLP, Prof. Howard, Tulane University 03-Nov-2014 >>> len(catWords) 170576 >>> catWords[:4] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> catWords[-4:] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() ['news', 'romance'] >>> cfd['news'] <FreqDist with 100554 outcomes> >>> cfd['romance'] <FreqDist with 70022 outcomes> >>> cfd['romance']['could'] 193 >>> list(cfd['romance']) [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014

A more interesting example can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 sci fi 16 49 4 8 romance 74 193 11 51 45 43 humor 30 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014

Conditions = categories, sample = modal verbs # from nltk.corpus import brown # from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c) if w in mod] >>> cfd = ConditionalFreqDist(catWord) >>> cfd.tabulate() >>> cfd.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

dfc.tabulate() can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014

dfc.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

Another example The task is to find the frequency of 'america' and 'citizen' in NLTK's corpus of presedential inaugural addresses: >>> from nltk.corpus import inaugural >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt'] NLP, Prof. Howard, Tulane University 03-Nov-2014

The code # from nltk.corpus import inaugural # from nltk.probability import ConditionalFreqDist >>> keys = ['america', 'citizen'] >>> keyYear = [(w, title[:4]) for title in inaugural.fileids() for w in inaugural.words(title) for w in keys if w.lower().startswith(keys)] >>> cfd2 = ConditionalFreqDist(keyYear) >>> cfd2.tabulate() >>> cfd2.plot() annos = [fileid[:4] for fileid in inaugural.fileids()] NLP, Prof. Howard, Tulane University 03-Nov-2014

cfd2.tabulate() cannot be read 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905 1909 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 america 2 1 8 0 1 0 1 1 2 0 0 2 2 7 0 2 2 3 2 1 0 0 1 2 4 6 9 9 7 0 12 4 24 11 12 2 5 12 2 4 6 7 7 10 10 23 5 16 21 11 33 31 20 30 15 citizen 5 1 6 7 10 1 4 14 15 3 2 3 7 38 11 2 4 7 7 0 5 3 9 9 13 12 10 10 2 1 6 3 6 5 12 1 2 1 1 1 7 0 5 4 1 1 0 3 6 3 2 10 11 7 2 NLP, Prof. Howard, Tulane University 03-Nov-2014

It's something like … 1789 1793 1797 1801 1805… america 2 1 8 citizen citizen 5 9 7 10 NLP, Prof. Howard, Tulane University 03-Nov-2014

Same thing, changing axes america citizen 1861 2 7 1941 12 1 1789 2 5 1865 1 0 1945 2 1 1793 1 1 1869 0 5 1949 4 1 1797 8 6 1873 0 3 1953 6 7 1801 0 7 1877 1 9 1957 7 0 1805 1 10 1881 2 9 1961 7 5 1809 0 1 1885 4 13 1965 10 4 1813 1 4 1889 6 12 1969 10 1 1817 1 14 1893 9 10 1973 23 1 1821 2 15 1897 9 10 1977 5 0 1825 0 3 1901 7 2 1981 16 3 1829 0 2 1905 0 1 1985 21 6 1833 2 3 1909 12 6 1989 11 3 1837 2 7 1917 4 3 1993 33 2 1841 7 38 1921 24 6 1997 31 10 1845 0 11 1925 11 5 2001 20 11 1849 2 2 1929 12 12 2005 30 7 1853 2 4 1933 2 1 2009 15 2 1857 3 7 1937 5 2 NLP, Prof. Howard, Tulane University 03-Nov-2014

dfc2.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

Next time Q8 Twitter maybe NLP, Prof. Howard, Tulane University 03-Nov-2014