Presentation is loading. Please wait.

Presentation is loading. Please wait.

NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 08-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 The quiz as a function in a script Review of scripts & functions 08-Oct-2014 3 NLP, Prof. Howard, Tulane University

4 Open Spyder 08-Oct-2014 4 NLP, Prof. Howard, Tulane University

5 Could you download the archive? NLTK 08-Oct-2014 5 NLP, Prof. Howard, Tulane University

6 08-Oct-2014NLP, Prof. Howard, Tulane University 6 Loading the book's texts >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1,..., text9 and sent1,..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G. K. Chesterton 1908 >>>

7 08-Oct-2014NLP, Prof. Howard, Tulane University 7 Searching text  Show every token of a word in context, called concordance view: >>> text1.concordance('monstrous')  Show the words that appear in a similar range of contexts: >>> text1.similar('monstrous')  Show the contexts that two words share: >>> text1.common_contexts(['whale','man'])

8 08-Oct-2014NLP, Prof. Howard, Tulane University 8 Searching text, cont.  Plot how far each token of a word is from the beginning of a text. >>> text1.dispersion_plot(['monstrous'])  Generate random text. >>> text1.generate()

9 08-Oct-2014NLP, Prof. Howard, Tulane University 9 Counting vocabulary  Count the word and punctuation tokens in a text: >>> len(text1)  List the unique words, i.e. the word types, in a text: >>> set(text1)  Count how many types there are in a text: >>> len(set(text1))  Count the tokens of a word type: >>> text1.count('smote')

10 08-Oct-2014NLP, Prof. Howard, Tulane University 10 Lexical richness or diversity  The lexical richness or diversity of a text can be estimated as tokens per type: >>> len(text1) / len(set(text1)  The frequency of a type can be estimated as tokens per all tokens, but '/' does integer division: >>> from __future__ import division >>> 100 * text1.count('a') / len(text1)

11 There is no quiz for Monday. We will learn how to get our own text into Python & NLTK. Next time 08-Oct-2014NLP, Prof. Howard, Tulane University 11


Download ppt "NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google