TEXT STATISTICS 1 DAY 23 - 10/20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Programming for Linguists
Advertisements

Dictionaries: Keeping track of pairs
NLTK & Python Day 4 LING Computational Linguistics Harry Howard Tulane University.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Python for Informatics: Exploring Information
Python Dictionary.
Dictionaries Last half of Chapter 5. Dictionary A sequence of key-value pairs – Key is usually a string or integer – Value can be any python object There.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
TEXT STATISTICS 7 DAY /05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)
Lecture 23 – Python dictionaries 1 COMPSCI 101 Principles of Programming.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Algorithmic Problem Solving CMSC 201 Adapted from slides by Marie desJardins (Spring 2015 Prof Chang version)
Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
© Copyright 2012 by Pearson Education, Inc. All Rights Reserved. Chapter 14 Tuples, Sets, and Dictionaries 1.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Python Programming in Context Chapter 4. Objectives To understand Python lists To use lists as a means of storing data To use dictionaries to store associative.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University.
REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Copyright © 2015 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 9 Dictionaries and Sets.
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
14. DICTIONARIES AND SETS Rocky K. C. Chang 17 November 2014 (Based on from Charles Dierbach, Introduction to Computer Science Using Python and Punch and.
TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Introduction to Computing Using Python Dictionaries: Keeping track of pairs  Class dict  Class tuple.
CSC Introduction to Data Structures Devon M. Simmonds Computer Science Department University of North Carolina Wilmington Wilmington, NC 28403
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
More Python Data Structures  Classes ◦ Should have learned in Simpson’s OOP ◦ If not, read chapters in Downey’s Think Python: Think like a Computer Scientist.
Python for NLP and the Natural Language Toolkit
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Intro To Pete Alonzi University of Virginia Library
Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing
Announcements Project 4 due Wed., Nov 7
Python - Dictionaries.
Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Computation with strings 3 Day 4 - 9/07/16
COSC 1323 – Computer Science Concepts I
Regular expressions 2 Day /23/16
LING 3820 & 6820 Natural Language Processing Harry Howard
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
LING 388: Computers and Language
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Regular expressions 3 Day /26/16
Dictionaries Dictionary: object that stores a collection of data
Dictionaries: Keeping track of pairs
Computation with strings 4 Day 5 - 9/09/16
CSE 231 Lab 8.
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

TEXT STATISTICS 1 DAY /20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 20-Oct-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction.   Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 20-Oct NLP, Prof. Howard, Tulane University

The quiz was the review Review 20-Oct NLP, Prof. Howard, Tulane University

Review of NLTK modules 20-Oct NLP, Prof. Howard, Tulane University

How to pre-process a text with the PlaintextCorpusReader 1. >>> from nltk.corpus import PlaintextCorpusReader 2. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') 3. >>> wubWords = wubReader.words() 20-Oct-2014NLP, Prof. Howard, Tulane University 6

Adding the methods of NLTK Text 1. >>> from nltk.text import Text 2. >>> text = Text(wubWords) 20-Oct-2014NLP, Prof. Howard, Tulane University 7

Put it all in a single line 1. >>> text = Text(PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8').words()) 20-Oct-2014NLP, Prof. Howard, Tulane University 8

Make it a function 1. def textLoader(doc): 2. from nltk.corpus import PlaintextCorpusReader 3. from nltk.text import Text 4. return Text(PlaintextCorpusReader('', doc, encoding='utf-8').words()) 20-Oct-2014NLP, Prof. Howard, Tulane University 9

8.3. How to calculate a frequency distribution with FreqDist 20-Oct NLP, Prof. Howard, Tulane University

Count the number of times that a word occurs in a text 1. >>> from corpFunctions import textLoader 2. >>> text = textLoader('Wub.txt') 3. >>> sample = list(set(text))[:10] 4. >>> sample [u'all', u'semantic', u'pardon', u'switched', u'Kindred', u'splashing', u'excellent', u'month', u'four', u'sunk'] 5. >>> tally = [] 6. >>> for word in sample: tally.append(text.count(word)) >>> tally 9. [13, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20-Oct-2014NLP, Prof. Howard, Tulane University 11

A table to associate type & count all13 semantic1 pardon1 switched1 Kindred1 20-Oct-2014NLP, Prof. Howard, Tulane University 12

A Python dictionary is a sequence within curly brackets of pairs of a key and a value joined by a colon, i.e. {key1:value1, key2:value2, …} How to keep track of disparate types with a dictionary 20-Oct NLP, Prof. Howard, Tulane University

Make a dictionary by hand 1. >>> tallyDict = {'all':13, 'semantic':1, 'pardon':1, 'switched':1, 'Kindred':1} 2. >>> tallyDict['all'] >>> tallyDict['some'] 5. Traceback (most recent call last): 6. File " ", line 1, in 7. KeyError: 'some' 20-Oct-2014NLP, Prof. Howard, Tulane University 14

Dicionary methods 1. >>> type(tallyDict) 2. >>> len(tallyDict) 3. >>> str(tallyDict) 4. >>> tallyDict.has_key('pardon') # prefer next line 5. >>> 'pardon' in tallyDict 6. >>> tallyDict.items() 7. >>> tallyDict.items()[:3] 8. >>> tallyDict.keys() 9. >>> tallyDict.keys()[:3] 10. >>> tallyDict.values() 11. >>> tallyDict.values()[:3] 20-Oct-2014NLP, Prof. Howard, Tulane University 15

Equalities btwn dict & text len(dictionary) == len(set(text)) sum(dictionary.values()) == len(text) 20-Oct-2014NLP, Prof. Howard, Tulane University 16

The algorithm is to create an empty dictionary; then, examine every word in the text in such a way that if the current word is already in the dictionary, add 1 to its value; otherwise, insert the word in the dictionary with the value of 1. Python follows English so closely that you can practically code this up word for word How to keep a tally with a dictionary 20-Oct NLP, Prof. Howard, Tulane University

Make a dictionary in a loop >>> wubDict = {} >>> for word in text:... if word in wubDict: wubDict[word] = wubDict[word]+1... else: wubDict[word] = Oct-2014NLP, Prof. Howard, Tulane University 18

Check the equalities 1. >>> len(wubDict) == len(set(text)) 2. >>> sum(wubDict.values()) == len(text) 20-Oct-2014NLP, Prof. Howard, Tulane University 19

View the first 30 items  >>> wubDict.items()[:30]  [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1), (u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1), (u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)] 20-Oct-2014NLP, Prof. Howard, Tulane University 20

FreqDist does all of the work of creating a dictionary of word frequencies for us, with the single caveat that it only works on NLTK text How to keep a tally with FreqDist() 20-Oct NLP, Prof. Howard, Tulane University

Increment a freq dist in a loop 1. >>> from nltk.probability import FreqDist 2. >>> wubFD = FreqDist() 3. >>> for word in text: wubFD.inc(word) >>> wubFD.items()[:30] 6. [(u'.', 289), (u'"', 164), (u'the', 146), (u',', 141), (u'I', 69), (u"'", 66), (u'said', 61), (u'The', 59), (u'to', 57), (u'."', 56), (u'wub', 54), (u'it', 53), (u',"', 48), (u'and', 41), (u'of', 39), (u'you', 37), (u'?"', 34), (u'It', 34), (u'his', 34), (u's', 34), (u'Captain', 33), (u'a', 33), (u'at', 30), (u'in', 28), (u'Peterson', 26), (u'Franco', 25), (u'He', 23), (u'was', 23), (u'he', 22), (u'up', 21)] 20-Oct-2014NLP, Prof. Howard, Tulane University 22

More on text stats Next time 20-Oct-2014NLP, Prof. Howard, Tulane University 23