LING 3820 & 6820 Natural Language Processing Harry Howard

Slides:

Advertisements

Similar presentations

High Frequency Words List A Group 1

Advertisements

TEXT STATISTICS 1 DAY /20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 7 DAY /05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.

TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

The Parts of Speech Warriner, John E., Mary E. Whitten and Francis Griffith. Warriner’s English Grammar and Composition Third Course. New York: Harcourt.

UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University.

NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Web Caching. Why Caching? Faster browsing experience for users Cache hit rate Traffic Prioritization Reduce network bandwidth requirements significantly.

Reflexive Pronouns Interactive Game BEGIN!. This morning, I dressed.  yourself  myself  herself  ourselves.

First Grade Sight Words. the of and a an to.

CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

FREQUENCY DISTRIBUTION

First Grade English High Frequency Words

First 100 high frequency words

Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing

Measuring Monolinguality

Pronouns Tutorial.

the and a to said in he I of it was you they on she is for at his but

I and the was to a in it of GO FISH GO FISH GO FISH GO FISH GO FISH

Pronouns Sandra Boyd.

Fry’s High Frequency Words

Regular expressions 2 Day /23/16

Core Concepts Lecture 1 Lexical Frequency.

LING 3820 & 6820 Natural Language Processing Harry Howard

control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing

LING 3820 & 6820 Natural Language Processing Harry Howard

Spoken English Ms El-Hendi.

LING 388: Computers and Language

Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing

NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing

Chomp! chomp! This presentation is brought to you by Grammar Bytes!, ©2016 by Robin L. Simmons.

Regular expressions 3 Day /26/16

100 High Frequency Words.

LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.

Computation with strings 4 Day 5 - 9/09/16

Reflexive Pronouns Interactive Game BEGIN!.

Trick Words Level 1 Press space bar to begin and then again after student has read each word.

Two other people.

High Frequency Words Set #1.

High Frequency Words.

the you are to was they of that as in for I and it with is on my a he

1st GRADE SIGHT WORDS.

Write the word..

TEXT ANALYSIS BY MEANS OF ZIPF’S LAW

Frequency Distributions

Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing

Presentation transcript:

Text statistics 2 Day 24 - 10/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 24-Oct-2014

Open Spyder NLP, Prof. Howard, Tulane University 24-Oct-2014

Review of NLTK modules NLP, Prof. Howard, Tulane University 24-Oct-2014

Put it all in a single line >>> temp = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') >>> temp1 = temp.words() >>> temp2 = Text(temp1) >>> text = Text(PlaintextCorpusReader('', 'Wub.txt', encoding='utf- 8').words()) NLP, Prof. Howard, Tulane University 24-Oct-2014

The text prepration function def textLoader(doc): from nltk.corpus import PlaintextCorpusReader from nltk.text import Text return Text(PlaintextCorpusReader('', doc, encoding='utf-8').words()) >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') NLP, Prof. Howard, Tulane University 24-Oct-2014

A dicionary maps a key to a value key1:value1 key2:value2 key3:value3 NLP, Prof. Howard, Tulane University 24-Oct-2014

Dicionary methods >>> tallyDict = {'all':13, 'semantic':1, 'pardon':1, 'switched':1, 'Kindred':1} >>> type(tallyDict) >>> len(tallyDict) >>> str(tallyDict) >>> 'pardon' in tallyDict >>> tallyDict.items() >>> tallyDict.keys() >>> tallyDict.values() >>> tallyDict['all'] NLP, Prof. Howard, Tulane University 24-Oct-2014

Make a dictionary in a loop >>> wubCount = {} >>> for word in text: ... if word in wubCount: wubCount[word]+=1 ... else: wubCount[word]=1 ... >>> len(wubCount) >>> wubCount.items()[:30] [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1),… ] NLP, Prof. Howard, Tulane University 24-Oct-2014

8.3.4. How to keep a tally with FreqDist() FreqDist does all of the work of creating a dictionary of word counts for us, with the single caveat that it only works on NLTK text. NLP, Prof. Howard, Tulane University 24-Oct-2014

Create a freqdist tally with list comprehension syntax >>> from nltk.probability import FreqDist >>> wubFD = FreqDist(word for word in text) >>> wubFD.items()[:30] [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1), (u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1), (u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)] NLP, Prof. Howard, Tulane University 24-Oct-2014

A freqdist has all the dict methods >>> type(wubFD) >>> len(wubFD) >>> str(wubFD) >>> 'pardon' in wubFD >>> wubFD.items() >>> wubFD.keys() >>> wubFD.values() NLP, Prof. Howard, Tulane University 24-Oct-2014

A new terminology from statistics >>> wubFD <FreqDist with 929 samples and 3693 outcomes> >>> len(wubFD.samples()) 929 >>> len(wubFD.keys()) >>> wubFD.B() >>> sum(wubFD.values()) 3693 >>> wubFD.N() NLP, Prof. Howard, Tulane University 24-Oct-2014

key sample B value outcome N A dicionary maps a key to a value ,an experiment observes an outcome from a sample key sample B value outcome N key1:value1 key2:value2 key3:value3 NLP, Prof. Howard, Tulane University 24-Oct-2014

More new methods >>> wubFD['the'] > wubFD['wub'] >>> wubFD.max() # ~ value to key u'.' >>> wubFD[wubFD.max()] 289 >>> wubFD.Nr(289) # ~ value to key 1 >>> wubFD.freq('.') 0.07825616030327646 NLP, Prof. Howard, Tulane University 24-Oct-2014

An example >>> wubFD['wub'] 54 >>> wubFD.N() 3693 from __future__ import division 54/3693 0.01462225832656377 >>> wubFD.freq('wub') >>> round(wubFD.freq('wub'), 3) 0.015 NLP, Prof. Howard, Tulane University 24-Oct-2014

A hapax is a word that only appears once in a text >>> wubFD.hapaxes()[:30] [u'...."', u'1952', u'://', u'Am', u'An', u'Anything', u'Apparently', u'Are', u'Atomic', u'BEYOND', u'Back', u'Be', u'Beyond', u'Blundell', u'By', u'Cheer', u'DICK', u'Dick', u'Distributed', u'Do', u'Earth', u'Earthmen', u'Eat', u'Eating', u'End', u'Extensive', u'Finally', u'For', u'Good', u'Greg'] >>> len(wubFD.hapaxes()) 592 >>> wubFD.Nr(1) NLP, Prof. Howard, Tulane University 24-Oct-2014

How to create a table of results with tabulate() >>> wubFD.tabulate(10) . " the , I ' said The to ." 289 164 146 141 69 66 61 59 57 56 >>> wubFD.tabulate(10, 20) wub it ," and of you ?" It his s 54 53 48 41 39 37 34 34 34 34 NLP, Prof. Howard, Tulane University 24-Oct-2014

wubFD.plot(50) NLP, Prof. Howard, Tulane University 24-Oct-2014

wubFD.plot(50, cumulative=True) NLP, Prof. Howard, Tulane University 24-Oct-2014

Zipf's law http://en.wikipedia.org/wiki/Zipf's_law Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus. NLP, Prof. Howard, Tulane University 24-Oct-2014

Other examples The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. NLP, Prof. Howard, Tulane University 24-Oct-2014

What to do about the most frequent words >>> import nltk >>> stop=nltk.corpus.stopwords.words('english') >>> stop ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 24-Oct-2014

Next time Homework 8.3.3. Practice with dictionaries Conditional frequency NLP, Prof. Howard, Tulane University 24-Oct-2014