Text Corpora and Lexical Resources Chapter 2 of Natural Language Processing with Python.

Slides:



Advertisements
Similar presentations
Programming for Linguists
Advertisements

Programming for Linguists
The Assembly Language Level
NLTK & Python Day 4 LING Computational Linguistics Harry Howard Tulane University.
Information Retrieval in Practice
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Chapter 1 Program Design
Introduction to XML Extensible Markup Language
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
Chapter 3 Planning Your Solution
Overview of Search Engines
Basic Concept of Data Coding Codes, Variables, and File Structures.
Chapter 3: Introduction to C Programming Language C development environment A simple program example Characters and tokens Structure of a C program –comment.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Test Taking Tips How to help yourself with multiple choice and short answer questions for reading selections A. Caldwell.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Fundamentals of Python: From First Programs Through Data Structures
Avoiding the generic in Genre Writing. Presentation guide 1.The Importance of Being Prepared 2.Belonging practise tasks 3.Hints for genre writing in the.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
ELN – Natural Language Processing Giuseppe Attardi
Comp 248 Introduction to Programming Chapter 4 - Defining Classes Part A Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
JSP Standard Tag Library
Fundamentals of Python: First Programs
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
CFT Offline Monitoring Michael Friedman. Contents Procedure  About the executable  Notes on how to run Results  What output there is and how to access.
Tutorial 10 Adding Spry Elements and Database Functionality Dreamweaver CS3 Tutorial 101.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
Lists in Python.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Tutorial 1: Browser Basics.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
PYTHON: PART 2 Catherine and Annie. VARIABLES  That last program was a little simple. You probably want something a little more challenging.  Let’s.
Content and Computer Platforms Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers.
Fundamentals of Web Design Copyright ©2004  Department of Computer & Information Science Introducing XHTML: Module A: Web Design Basics.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
New Perspectives on XML, 2nd Edition
Collecting Things Together - Lists 1. We’ve seen that Python can store things in memory and retrieve, using names. Sometime we want to store a bunch of.
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
XP Tutorial 8 Adding Interactivity with ActionScript.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.
Files Tutor: You will need ….
Sequencing The most simple type of program uses sequencing, a set of instructions carried out one after another. Start End Display “Computer” Display “Science”
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Midterm Review Important control structures Functions Loops Conditionals Important things to review Binary Boolean operators (and, or, not) Libraries (import.
12. MODULES Rocky K. C. Chang November 6, 2015 (Based on from Charles Dierbach. Introduction to Computer Science Using Python and William F. Punch and.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
General Architecture of Retrieval Systems 1Adrienn Skrop.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Quiz 1 A sample quiz 1 is linked to the grading page on the course web site. Everything up to and including this Friday’s lecture except that conditionals.
3.3 Fundamentals of data representation
Containers and Lists CIS 40 – Introduction to Programming in Python
Chapter 27 WWW and HTTP.
LING 3820 & 6820 Natural Language Processing Harry Howard
Introduction to Computer Science
Presentation transcript:

Text Corpora and Lexical Resources Chapter 2 of Natural Language Processing with Python

So far -- We have learned the basics of Python – Reading and writing – interactive and files – Control structures if, while, for, function and class definitions – Important data structures: lists, tuples, numeric (int and float) – Basic natural language processing techniques

Tonight Expanding the scope of textual information we can access Additional language constructions for working with text Reintroduce some Python structures for organizing programs

Text corpora A collection of text entities – Usually there is some unifying characteristic, but not always – Typical examples All issues of a newspaper for a period of time A collection of reports from a particular industry or standards body – More recent The whole collection of posts to twitter All the entries in a blog or set of blogs

Check it out Go to Take a few minutes to explore the site. – Look at the top 100 downloads of yesterday – Can you characterize them? What do you think of this list?

Corpora in nltk The nltk includes part of the Gutenberg collection Find out which ones by >>>nltk.corpus.gutenberg.fileids() These are the texts of the Gutenberg collection that are downloaded with the nltk package.

Accessing other texts We will explore the files loaded with nltk You may want to explore other texts also. From the help(nltk.corpus): – If C{item} is one of the unique identifiers listed in the corpus module's C{items} variable, then the corresponding document will be loaded from the NLTK corpus package. – If C{item} is a filename, then that file will be read. For now – just a note that we can use these tools on other texts that we download or acquire from any source.

Using the tools we saw before The particular texts we saw in chapter 1 were accessed through aliases that simplified the interaction. Now, more general case, we have to do more. To get the list of words in a text: >>>emma = nltk.corpus.gutenberg.words('austen-emma.txt') Now we have the form we had for the texts of Chapter 1 and can use the tools found there. Try: >>> len(emma) Note the frequency of use of Jane Austen books ???

Shortened reference Global context – Instead of citing the gutenberg corpus for each resource, >>> from nltk.corpus import gutenberg >>> gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen- sense.txt',...] >>> emma = gutenberg.words('austen-emma.txt') So, nltk.corpus.gutenberg.words('austen-emma.txt') becomes just gutenberg.words('austen-emma.txt')

Other access options gutenberg.words('austen-emma.txt') – the words of the text gutenberg.raw('austen-emma.txt') – the original text, no separation into tokens (words). One long string. gutenberg.sents('austen-emma.txt') – the text divided into sentences

Some code to run Enter and run the code for counting characters, words, sentences and finding the lexical diversity score of each text in the corpus. import nltk from nltk.corpus import gutenberg for fileid in gutenberg.fileids(): num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) print int(num_chars/num_words), int(num_words/num_sents), \ int(num_words/num_vocab), fileid Short, simple code. Already seeing some noticeable time to execute

Modify the code Simple change – print out the total number of characters, words, sentences for each text.

The text corpus Take a look at your directory of nltk_data to see the variety of text materials accessible to you. – Some are not plain text and we cannot use them yet – but will – Of the plain text, note the diversity Classic published materials News feeds, movie reviews Overheard conversations, internet chat – All categories of language are needed to understand the language as it is defined and as it is used.

The Brown Corpus First 1 million word corpus Explore – – what are the categories? – Access words or sentences from one or more categories or fileids >>> from nltk.corpus import brown >>> brown.categories() >>> brown.fileids(categories=” ")

Sylistics Enter that code and run it. What does it give you? What does it mean? >>> from nltk.corpus import brown >>> news_text = brown.words(categories='news') >>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> for m in modals:... print m + ':', fdist[m],

Spot check Repeat the previous code, but look for the use of those same words in the categories for religion, government Now analyze the use of the “wh” words in the news category and one other of your choice. (Who, What, Where, When, Why)

One step comparison Consider the following code: import nltk from nltk.corpus import brown cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] modals = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions=genres, samples=modals) Enter and run it. What does it do?

Other corpora There is some information about the Reuters and Inaugural address corpora also. Take a look at them with the online site. (5 minutes or so)

Spot Check Take a look at Table 2-2 for a list of some of the material available from the nltk project. (I cannot fit it on a slide in any meaningful way) Confirm that you have downloaded all of these (when you did the nltk.download, if you selected all) Find them in your directory and explore. – How many languages are represented? – How would you describe the variety of content? What do you find most interesting/unusual/strange/fun?

Languages The Universal Declaration of Human Rights is available in 300 languages. >>>udhr.fileids()

Organization of Corpora The organization will vary according to the type of corpus. Knowing the organization may be important for using the corpus.

ExampleDescription fileids()the files of the corpus fileids([categories])the files of the corpus corresponding to these categories categories()the categories of the corpus categories([fileids])the categories of the corpus corresponding to these files raw()the raw content of the corpus raw(fileids=[f1,f2,f3])the raw content of the specified files raw(categories=[c1,c2])the raw content of the specified categories words()the words of the whole corpus words(fileids=[f1,f2,f3])the words of the specified fileids words(categories=[c1,c2])the words of the specified categories sents()the sentences of the whole corpus sents(fileids=[f1,f2,f3])the sentences of the specified fileids sents(categories=[c1,c2])the sentences of the specified categories abspath(fileid)the location of the given file on disk encoding(fileid)the encoding of the file (if known) open(fileid)open a stream for reading the given corpus file root()the path to the root of locally installed corpus readme()the contents of the README file of the corpus Table 2.3 – Basic Corpus Functionality in NLTK

from help(nltk.corpus.reader) Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are: - I{corpus}.words(): list of str - I{corpus}.sents(): list of (list of str) - I{corpus}.paras(): list of (list of (list of str)) - I{corpus}.tagged_words(): list of (str,str) tuple - I{corpus}.tagged_sents(): list of (list of (str,str)) - I{corpus}.tagged_paras(): list of (list of (list of (str,str))) - I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves) - I{corpus}.parsed_sents(): list of (Tree with str leaves) - I{corpus}.parsed_paras(): list of (list of (Tree with str leaves)) - I{corpus}.xml(): A single xml ElementTree - I{corpus}.raw(): unprocessed corpus contents For example, to read a list of the words in the Brown Corpus, use C{nltk.corpus.brown.words()}: >>> from nltk.corpus import brown >>> print brown.words() Types of information returned from typical functions

Spot check Choose a corpus and exercise some of the functions – Look at raw, words, sents, categories, fileids, encoding Repeat for a source in a different language. Work in pairs and talk about what you find, what you might want to look for. – Report out briefly

Working with your own sources NLTK provides a great bunch of resources, but you will certainly want to access your own collections – other books you download, or files you create, etc. from nltk.corpus import PlaintextCorpusReader >>> corpus_root = '/usr/share/dict' >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') >>> wordlists.fileids() ['README', 'connectives', 'propernames', 'web2', 'web2a', 'words'] >>> wordlists.words('connectives') ['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is',...] You could get the list of files in any directory

Other Corpus readers There are a number of different readers for different types of corpora. Many files in corpora are “marked up” in various ways and the reader needs to understand the markings to return meaningful results. We will stick to the PlaintextCorpusReader for now

Conditional Frequency Distribution When texts in a corpus are divided into categories, we may want to look at the characteristics by category – word use by author or over time, for example Figure 2.4: Counting Words Appearing in a Text Collection (a conditional frequency distribution)

Frequency Distributions A frequency distribution counts some occurrence, such as the use of a word or phrase. A conditional frequency distribution, counts some occurrence separately for each of some number of conditions (Author, date, genre, etc.) For example: >>> genre_word = [(genre, word)... for genre in ['news', 'romance']... for word in brown.words(categories=genre)] >>> len(genre_word) Think about this. What exactly is happening? What are those 170,576 things?, Run the code, then enter just >>> genre_word

For each genre (‘news’, ‘romance’) loop over every word in that genre produce the pairs showing the genre and the word What type of data is genre_word? >>> genre_word = [(genre, word)... for genre in ['news', 'romance']... for word in brown.words(categories=genre)] >>> len(genre_word)

Spot check Refining the result – When you displayed genre_word, you may have noticed that some of the words are not words at all. They are punctuation marks. – Refine this code to eliminate the entries in genre_word in which the word is not all alphabetic. – Remove duplicate words that differ only in capitalization. Work together. Talk about what you are doing. Share your ideas and insights

Conditional Frequency Distribution From the list of pairs we created, we can generate a conditional frequency distribution of words by genre >>> cfd = nltk.ConditionalFreqDist(genre_word) >>> cfd >>> cfd.conditions() Run these. Look at the results

Look at the conditional distributions >>> cfd['news'] >>> cfd['romance'] >>> list(cfd['romance']) [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had', '?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him', 'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The',...] >>> cfd['romance']['could'] 193

Presenting the results Plotting and tabulating – concise representations of the frequency distributions Tabulate With no parameters, simply tabulates all the conditions against all the values cfd.tabulate()

Look closely >>> from nltk.corpus import inaugural >>> cfd = nltk.ConditionalFreqDist(... (target, fileid[:4])... for fileid in inaugural.fileids()... for w in inaugural.words(fileid)... for target in ['america', 'citizen']... if w.lower().startswith(target)) Get the text The two axes Narrow the word choice All the words in each file Remember List Comprehension?

Three elements For a conditional frequency distribution: – Two axes condition or event, something of interest some connected characteristic – a year, a place, an author, anything that is related in some way to the event – Something to count For the condition and the characteristic, what are we counting? Words? actions? what? – From the previous example inaugural addresses specific words count the number of times that a form of either of those words occurred in that address

Spot check Run the code on the previous example. How many times was some version of “citizen” used in the 1909 inaugural address? How many times was “america” mentioned in 2009? Play with the code. What can you leave off and still get some meaningful output?

Another case Somewhat simpler specification Distribution of length of word in languages, with restriction on languages >>> from nltk.corpus import udhr >>> languages = ['Chickasaw', 'English', 'German_Deutsch',... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] >>> cfd = nltk.ConditionalFreqDist(... (lang, len(word))... for lang in languages... for word in udhr.words(lang + '-Latin1'))

Now tabulate Only choose to tabulate some of the results. >>> cfd.tabulate(conditions=['English', 'German_Deutsch'],... samples=range(10), cumulative=True) English German_Deutsch Note – so far, I cannot do plots. I hope to get that fixed. If you can do plots, do try some of the examples.

Common methods for Conditional Frequency Distributions cfdist = ConditionalFreqDist(pairs)create a conditional frequency distribution from a list of pairs cfdist.conditions()alphabetically sorted list of conditions cfdist[condition]the frequency distribution for this condition cfdist[condition][sample]frequency for the given sample for this condition cfdist.tabulate()tabulate the conditional frequency distribution cfdist.tabulate(samples, conditions)tabulation limited to the specified samples and conditions cfdist.plot()graphical plot of the conditional frequency distribution cfdist.plot(samples, conditions)graphical plot limited to the specified samples and conditions cfdist1 < cfdist2test if samples in cfdist1 occur less frequently than in cfdist2

References This set of slides comes very directly from the book, Natural Language Processing with Python.