NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
NLTK & Python Day 4 LING Computational Linguistics Harry Howard Tulane University.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Python Programming: An Introduction to Computer Science
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Slide 1 Analyzing Patterns of Missing Data While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in.
The Data Element. 2 Data type: A description of the set of values and the basic set of operations that can be applied to values of the type. Strong typing:
Group practice in problem design and problem solving
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
The Data Element. 2 Data type: A description of the set of values and the basic set of operations that can be applied to values of the type. Strong typing:
Fortran 1- Basics Chapters 1-2 in your Fortran book.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Lecture 3 Ngrams Topics Python NLTK N – grams SmoothingReadings: Chapter 4 – Jurafsky and Martin January 23, 2013 CSCE 771 Natural Language Processing.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.
Goals of Course Introduction to the programming language C Learn how to program Learn ‘good’ programming practices.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Text classification Day 35 LING Computational Linguistics Harry Howard Tulane University.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
CPS120: Introduction to Computer Science
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Lesson Counting Techniques. Objectives Solve counting problems using the Multiplication Rule Solve counting problems using permutations Solve counting.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Recognizing PL/SQL Lexical Units. 2 home back first prev next last What Will I Learn? List and define the different types of lexical units available in.
Information extraction 2 Day 37 LING Computational Linguistics Harry Howard Tulane University.
THE BASICS OF A C++ PROGRAM EDP 4 / MATH 23 TTH 5:45 – 7:15.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
ECS 15 Variables. Outline  Using IDLE  Building blocks of programs: Text Numbers Variables!  Writing a program  Running the program.
Lecture 4 Ngrams Smoothing
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
Data And Variables Chapter Names vs. Values Michael Jordan name (the letter sequence used to refer to something) value (the thing itself)
TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
A Sample Program #include using namespace std; int main(void) { cout
1 C Syntax and Semantics Dr. Sherif Mohamed Tawfik Lecture Two.
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Quiz 1 A sample quiz 1 is linked to the grading page on the course web site. Everything up to and including this Friday’s lecture except that conditionals.
1 Sections 3.1 – 3.2a Basic Syntax and Semantics Fundamentals of Java: AP Computer Science Essentials, 4th Edition Lambert / Osborne.
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Introduction to Corpus Linguistics: Exploring Collocation
LING 388: Computers and Language
Regular expressions 2 Day /23/16
Lesson 6: User Input and Strings
LING 3820 & 6820 Natural Language Processing Harry Howard
First Python Program Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming: An.
LING 388: Computers and Language
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
CSCE 771 Natural Language Processing
Regular expressions 3 Day /26/16
Digital Encodings.
The Data Element.
Computation with strings 4 Day 5 - 9/09/16
The structure of programming
The Data Element.
Thinking procedurally
LING 388: Computers and Language
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University

31-Aug-2009LING , Prof. Howard, Tulane University2 Course organization  I have requested that Python and NLTK be installed on the computers in this room.

NLPP §1.2 A Closer Look at Python: Texts as Lists of Words

31-Aug-2009LING , Prof. Howard, Tulane University4 Variables  variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold']

31-Aug-2009LING , Prof. Howard, Tulane University5 How to name variables  Valid names (or identifiers) …  must start with a letter, optionally followed by digits or letters;  are case-sensitive;  cannot contain whitespace (use an underscore) or a dash (means minus);  cannot be a reserved word.

31-Aug-2009LING , Prof. Howard, Tulane University6 Strings  Strings are individual words, i.e. a single element list.  Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python']

NLPP §1.3. Computing with Language: Simple Statistics

31-Aug-2009LING , Prof. Howard, Tulane University8 Frequency distribution  What is a frequency distribution?  It tells us the frequency of each vocabulary item in a text.  It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.  What function in NLTK calculates it?  FreqDist(text_name)  What expression lists the tokens with their distribution?  text_name.keys()

31-Aug-2009LING , Prof. Howard, Tulane University9 Very frequent words  How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True)

31-Aug-2009LING , Prof. Howard, Tulane University10 Very infrequent words  Words that occur only once are called hapaxes.  >>>fdist1.hapaxes()  In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others.  How would you describe them?

31-Aug-2009LING , Prof. Howard, Tulane University11 Summary Most frequentLeast frequent Lengthshortlong Meaningvery generalvery specific Coverage of textlarge proportionsmall proportion

31-Aug-2009LING , Prof. Howard, Tulane University12 Question  Which group would you look in to find words that help you understand what the text is about?  Neither.

31-Aug-2009LING , Prof. Howard, Tulane University13 Fine-grained word selection  Some Python expressions are based on set theory. a) {w | w ∈ V & P(w)} b) [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?)  Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15]

31-Aug-2009LING , Prof. Howard, Tulane University14 Finding words that characterize a text  Not too short (>?) and not too infrequent (>?)  >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7]

31-Aug-2009LING , Prof. Howard, Tulane University15 Finding groups of words  What is the name for a sequence of two words?  Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]  What is the name for a sequence of words that occur together unusually often?  Collocation ~ collocations()  They are essentially bigrams that occur more often than we would expect based on the frequency of individual words.

31-Aug-2009LING , Prof. Howard, Tulane University16 Example  >>> text4.collocations()  Building collocations list  United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money

31-Aug-2009LING , Prof. Howard, Tulane University17 Counting Other Things

Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2