TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 20-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 20-Oct-2014 3 NLP, Prof. Howard, Tulane University

The quiz was the review Review 20-Oct-2014 4 NLP, Prof. Howard, Tulane University

Review of NLTK modules 20-Oct-2014 5 NLP, Prof. Howard, Tulane University

7.5.1. How to pre-process a text with the PlaintextCorpusReader 1. >>> from nltk.corpus import PlaintextCorpusReader 2. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') 3. >>> wubWords = wubReader.words() 20-Oct-2014NLP, Prof. Howard, Tulane University 6

7.5.2. Adding the methods of NLTK Text 1. >>> from nltk.text import Text 2. >>> text = Text(wubWords) 20-Oct-2014NLP, Prof. Howard, Tulane University 7

Put it all in a single line 1. >>> text = Text(PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8').words()) 20-Oct-2014NLP, Prof. Howard, Tulane University 8

Make it a function 1. def textLoader(doc): 2. from nltk.corpus import PlaintextCorpusReader 3. from nltk.text import Text 4. return Text(PlaintextCorpusReader('', doc, encoding='utf-8').words()) 20-Oct-2014NLP, Prof. Howard, Tulane University 9

8.3. How to calculate a frequency distribution with FreqDist 20-Oct-2014 10 NLP, Prof. Howard, Tulane University

Count the number of times that a word occurs in a text 1. >>> from corpFunctions import textLoader 2. >>> text = textLoader('Wub.txt') 3. >>> sample = list(set(text))[:10] 4. >>> sample [u'all', u'semantic', u'pardon', u'switched', u'Kindred', u'splashing', u'excellent', u'month', u'four', u'sunk'] 5. >>> tally = [] 6. >>> for word in sample: tally.append(text.count(word)) 7.... 8. >>> tally 9. [13, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20-Oct-2014NLP, Prof. Howard, Tulane University 11

A table to associate type & count all13 semantic1 pardon1 switched1 Kindred1 20-Oct-2014NLP, Prof. Howard, Tulane University 12

A Python dictionary is a sequence within curly brackets of pairs of a key and a value joined by a colon, i.e. {key1:value1, key2:value2, …}. 8.3.1. How to keep track of disparate types with a dictionary 20-Oct-2014 13 NLP, Prof. Howard, Tulane University

Make a dictionary by hand 1. >>> tallyDict = {'all':13, 'semantic':1, 'pardon':1, 'switched':1, 'Kindred':1} 2. >>> tallyDict['all'] 3. 13 4. >>> tallyDict['some'] 5. Traceback (most recent call last): 6. File " ", line 1, in 7. KeyError: 'some' 20-Oct-2014NLP, Prof. Howard, Tulane University 14

Dicionary methods 1. >>> type(tallyDict) 2. >>> len(tallyDict) 3. >>> str(tallyDict) 4. >>> tallyDict.has_key('pardon') # prefer next line 5. >>> 'pardon' in tallyDict 6. >>> tallyDict.items() 7. >>> tallyDict.items()[:3] 8. >>> tallyDict.keys() 9. >>> tallyDict.keys()[:3] 10. >>> tallyDict.values() 11. >>> tallyDict.values()[:3] 20-Oct-2014NLP, Prof. Howard, Tulane University 15

Equalities btwn dict & text len(dictionary) == len(set(text)) sum(dictionary.values()) == len(text) 20-Oct-2014NLP, Prof. Howard, Tulane University 16

The algorithm is to create an empty dictionary; then, examine every word in the text in such a way that if the current word is already in the dictionary, add 1 to its value; otherwise, insert the word in the dictionary with the value of 1. Python follows English so closely that you can practically code this up word for word. 8.3.2. How to keep a tally with a dictionary 20-Oct-2014 17 NLP, Prof. Howard, Tulane University

Make a dictionary in a loop >>> wubDict = {} >>> for word in text:... if word in wubDict: wubDict[word] = wubDict[word]+1... else: wubDict[word] = 1... 20-Oct-2014NLP, Prof. Howard, Tulane University 18

Check the equalities 1. >>> len(wubDict) == len(set(text)) 2. >>> sum(wubDict.values()) == len(text) 20-Oct-2014NLP, Prof. Howard, Tulane University 19

View the first 30 items  >>> wubDict.items()[:30]  [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1), (u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1), (u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)] 20-Oct-2014NLP, Prof. Howard, Tulane University 20

FreqDist does all of the work of creating a dictionary of word frequencies for us, with the single caveat that it only works on NLTK text. 8.3.4. How to keep a tally with FreqDist() 20-Oct-2014 21 NLP, Prof. Howard, Tulane University

Increment a freq dist in a loop 1. >>> from nltk.probability import FreqDist 2. >>> wubFD = FreqDist() 3. >>> for word in text: wubFD.inc(word) 4.... 5. >>> wubFD.items()[:30] 6. [(u'.', 289), (u'"', 164), (u'the', 146), (u',', 141), (u'I', 69), (u"'", 66), (u'said', 61), (u'The', 59), (u'to', 57), (u'."', 56), (u'wub', 54), (u'it', 53), (u',"', 48), (u'and', 41), (u'of', 39), (u'you', 37), (u'?"', 34), (u'It', 34), (u'his', 34), (u's', 34), (u'Captain', 33), (u'a', 33), (u'at', 30), (u'in', 28), (u'Peterson', 26), (u'Franco', 25), (u'He', 23), (u'was', 23), (u'he', 22), (u'up', 21)] 20-Oct-2014NLP, Prof. Howard, Tulane University 22

More on text stats Next time 20-Oct-2014NLP, Prof. Howard, Tulane University 23

TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback