NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Course organization 08-Oct-2014NLP, Prof. Howard, Tulane University 2 The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters 6. Control 6. Control
The quiz as a function in a script Review of scripts & functions 08-Oct NLP, Prof. Howard, Tulane University
Open Spyder 08-Oct NLP, Prof. Howard, Tulane University
Could you download the archive? NLTK 08-Oct NLP, Prof. Howard, Tulane University
08-Oct-2014NLP, Prof. Howard, Tulane University 6 Loading the book's texts >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1,..., text9 and sent1,..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G. K. Chesterton 1908 >>>
08-Oct-2014NLP, Prof. Howard, Tulane University 7 Searching text Show every token of a word in context, called concordance view: >>> text1.concordance('monstrous') Show the words that appear in a similar range of contexts: >>> text1.similar('monstrous') Show the contexts that two words share: >>> text1.common_contexts(['whale','man'])
08-Oct-2014NLP, Prof. Howard, Tulane University 8 Searching text, cont. Plot how far each token of a word is from the beginning of a text. >>> text1.dispersion_plot(['monstrous']) Generate random text. >>> text1.generate()
08-Oct-2014NLP, Prof. Howard, Tulane University 9 Counting vocabulary Count the word and punctuation tokens in a text: >>> len(text1) List the unique words, i.e. the word types, in a text: >>> set(text1) Count how many types there are in a text: >>> len(set(text1)) Count the tokens of a word type: >>> text1.count('smote')
08-Oct-2014NLP, Prof. Howard, Tulane University 10 Lexical richness or diversity The lexical richness or diversity of a text can be estimated as tokens per type: >>> len(text1) / len(set(text1) The frequency of a type can be estimated as tokens per all tokens, but '/' does integer division: >>> from __future__ import division >>> 100 * text1.count('a') / len(text1)
There is no quiz for Monday. We will learn how to get our own text into Python & NLTK. Next time 08-Oct-2014NLP, Prof. Howard, Tulane University 11