TEXT STATISTICS 3 DAY 25 - 10/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:

Advertisements

Similar presentations

Advertisements

High Frequency Words List A Group 1

Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.

TEXT STATISTICS 1 DAY /20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 7 DAY /05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.

By Ciro Cattuto, Vittorio Loreto, and Luciano Pietronero Semiotic dynamics and collaborative tagging Present by Diyue Bu.

100 Most Common Words.

Reflexive Pronouns Grammar Test. Reflexive Pronouns Select the best reflexive pronoun to complete each sentence.

1st 100 sight words.

UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

CS324e - Elements of Graphics and Visualization Java Intro / Review.

TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

The Parts of Speech Warriner, John E., Mary E. Whitten and Francis Griffith. Warriner’s English Grammar and Composition Third Course. New York: Harcourt.

UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Directions: Press F5 to begin the slide show. Press the enter key to view each part of the review.

List A Sight Words.

Sight Words - List A Words

COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

A word that takes the place of a noun

SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Grammar Fix Part 1. Pronouns What are they? Words that take the place of a noun How many can you think of? There are many, but they fall in to Five main.

TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.

NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University.

COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.

Pronoun Case Her smacked he.. Determining which form of a pronoun to use is a matter of determining how the pronoun is functioning in the sentence and.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Statistical Properties of Text

ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Web Caching. Why Caching? Faster browsing experience for users Cache hit rate Traffic Prioritization Reduce network bandwidth requirements significantly.

Reflexive Pronouns Interactive Game BEGIN!. This morning, I dressed.  yourself  myself  herself  ourselves.

First Grade Sight Words. the of and a an to.

CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Pronouns. Subject Pronouns Take the place of a noun that is used as the subject of the sentence. They are found at the beginning of a phrase or clause.

LING 3820 & 6820 Natural Language Processing Harry Howard

Pronouns Sandra Boyd.

Core Concepts Lecture 1 Lexical Frequency.

LING 3820 & 6820 Natural Language Processing Harry Howard

control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing

LING 3820 & 6820 Natural Language Processing Harry Howard

Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing

NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing

CS 430: Information Discovery

Regular expressions 3 Day /26/16

Computation with strings 4 Day 5 - 9/09/16

Reflexive Pronouns Interactive Game BEGIN!.

Two other people.

High Frequency Words Set #1.

Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing

Presentation transcript:

TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 24-Oct-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction.   Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 24-Oct NLP, Prof. Howard, Tulane University

>>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist Review of dictionaries & FreqDist 24-Oct NLP, Prof. Howard, Tulane University

NLTK version  import nltk  print nltk.__version__ 24-Oct-2014NLP, Prof. Howard, Tulane University 5

Clarification of dictionary methods 1. >>> tallyDict = {'all':13, 'semantic':1} 2. >>> tallyDict['all'] 3. >>> tallyDict['pardon'] 4. >>> tallyDict['pardon'] = 1 5. >>> tallyDict['pardon'] = 2 6. >>> tallyDict = {'all':13, 'semantic':1, 'semantic':5} 7. >>> tallyDict[13] 24-Oct-2014NLP, Prof. Howard, Tulane University 6

A dicionary maps a key to a value 24-Oct-2014NLP, Prof. Howard, Tulane University 7 key1:value1 key2:value2 key3:value3 key (type) value (tokens)

Create a freqdist tally with list comprehension syntax 1. >>> from nltk.probability import FreqDist 2. >>> wubFD = FreqDist(word for word in text) 3. >>> wubFD.items()[:30] 4. [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1), (u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1), (u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)] 24-Oct-2014NLP, Prof. Howard, Tulane University 8

A new terminology from statistics 1. >>> wubFD >>> len(wubFD.keys()) >>> len(wubFD.samples()) >>> wubFD.B() >>> sum(wubFD.values()) >>> wubFD.N() Oct-2014NLP, Prof. Howard, Tulane University 9

A dicionary maps a key to a value, an experiment observes an outcome from a sample 24-Oct-2014NLP, Prof. Howard, Tulane University 10 key1:value1 key2:value2 key3:value3 key (type) sample B value (tokens) outcome N

Value to key 1. >>> wubFD.max() 2. u'.' 3. >>> wubFD[u'.'] >>> wubFD.Nr(289) >>> wubFD.r_Nr(289) 8. defaultdict(, {0: 0, 1: 592, 2: 140, 3: 51, 4: 32, 5: 17, 6: 13, 7: 13, 8: 8, 9: 6, 10: 5, 11: 3, 12: 1, 13: 3, 14: 3, 15: 3, 17: 1, 18: 4, 19: 2, 20: 2, 21: 1, 22: 1, 23: 2, 25: 1, 26: 1, 28: 1, 30: 1, 33: 2, 34: 4, 164: 1, 37: 1, 39: 1, 41: 1, 48: 1, 53: 1, 54: 1, 56: 1, 57: 1, 59: 1, 61: 1, 66: 1, 69: 1, 289: 1, 141: 1, 146: 1}) 24-Oct-2014NLP, Prof. Howard, Tulane University 11

More new methods 1. >>> wubFD['the'] > wubFD['wub'] 2. >>> wubFD.freq('.') Oct-2014NLP, Prof. Howard, Tulane University 12

An example 1. >>> wubFD['wub'] >>> wubFD.N() from __future__ import division 6. 54/ >>> wubFD.freq('wub') >>> round(wubFD.freq('wub'), 3) Oct-2014NLP, Prof. Howard, Tulane University 13

A hapax is a word that only appears once in a text 1. >>> wubFD.hapaxes()[:30] 2. [u'...."', u'1952', u'://', u'Am', u'An', u'Anything', u'Apparently', u'Are', u'Atomic', u'BEYOND', u'Back', u'Be', u'Beyond', u'Blundell', u'By', u'Cheer', u'DICK', u'Dick', u'Distributed', u'Do', u'Earth', u'Earthmen', u'Eat', u'Eating', u'End', u'Extensive', u'Finally', u'For', u'Good', u'Greg'] 3. >>> len(wubFD.hapaxes()) >>> wubFD.Nr(1) Oct-2014NLP, Prof. Howard, Tulane University 14

How to create a table of results with tabulate() 1. >>> wubFD.tabulate(10) 2.. " the, I ' said The to." >>> wubFD.tabulate(10, 20) 5. wub it," and of you ?" It his s Oct-2014NLP, Prof. Howard, Tulane University 15

24-Oct-2014NLP, Prof. Howard, Tulane University 16 wubFD.plot(50)

24-Oct-2014NLP, Prof. Howard, Tulane University 17 wubFD.plot(50, cumulative=True)

Zipf's law   Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.  Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.  For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million).  True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852).  Only 135 vocabulary items are needed to account for half the Brown Corpus. 24-Oct-2014NLP, Prof. Howard, Tulane University 18

Other examples  The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on.  The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in Oct-2014NLP, Prof. Howard, Tulane University 19

What to do about the most frequent words  >>> import nltk  >>> stop=nltk.corpus.stopwords.words('english')  >>> stop  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 24-Oct-2014NLP, Prof. Howard, Tulane University 20

Q7 Conditional frequency Next time 24-Oct-2014NLP, Prof. Howard, Tulane University 21