Classifying text NLTK Chapter 6. Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can.

Slides:



Advertisements
Similar presentations
FUNCTION FITTING Student’s name: Ruba Eyal Salman Supervisor:
Advertisements

Writing Pseudocode And Making a Flow Chart A Number Guessing Game
A small taste of inferential statistics
1 © 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation Response Rate in Surveys Key resource: Dillman, D.A.,
Configuration management
Chapter 11: Models of Computation
compilers and interpreters
CATHERINE AND ANNIE Python: Part 3. Intro to Loops Do you remember in Alice when you could use a loop to make a character perform an action multiple times?
INTRODUCTION Lesson 1 – Microsoft Word Word Basics
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
University of Sheffield NLP Module 4: Machine Learning.
CMPT 120 Algorithms Summer 2012 Instructor: Hassan Khosravi.
Machine Learning: Intro and Supervised Classification
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Programming for Linguists
Classification Classification Examples
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Announcements You survived midterm 2! No Class / No Office hours Friday.
Perceptron Learning Rule
Text Corpora and Lexical Resources Chapter 2 of Natural Language Processing with Python.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Programming for Linguists An Introduction to Python 15/12/2011.
Classifying text NLTK Chapter 6. Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can.
Microsoft ® Office Word 2007 Training Bullets, Numbers, and Lists ICT Staff Development presents:
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Artificial Neural Networks
Chapter 2: Algorithm Discovery and Design
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Three kinds of learning
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
Chapter 2: Algorithm Discovery and Design
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Invitation to Computer Science, Java Version, Second Edition.
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
PYTHON: PART 2 Catherine and Annie. VARIABLES  That last program was a little simple. You probably want something a little more challenging.  Let’s.
Text classification Day 35 LING Computational Linguistics Harry Howard Tulane University.
Universit at Dortmund, LS VIII
Learning from Observations Chapter 18 Through
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Handwritten Recognition with Neural Network Chatklaw Jareanpon, Olarik Surinta Mahasarakham University.
Developing Computer Games Testing & Documentation.
Section 10.1 Confidence Intervals
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Chapter 3 Whole Composition Summary of Key Points Writing processes can include eight activities: planning, gathering, writing, evaluating, getting feedback,
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Copyright Paula Matuszek Kinds of Machine Learning.
Chapter 2 The Nature of Learner Language By : Annisa Mustikanthi.
In the news: A recently security study suggests that a computer worm that ran rampant several years ago is still running on many machines, including 50%
OCR Computing GCSE © Hodder Education 2013 Slide 1 OCR GCSE Computing Python programming 3: Built-in functions.
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Classification & Clustering
Chapter 11: Learning Introduction
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Coding Concepts (Data- Types)
Evaluating Classifiers
Presentation transcript:

Classifying text NLTK Chapter 6

Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can we construct models of language that can be used to perform language processing tasks automatically? What can we learn about language from these models?

From words to larger units We looked at how words are indentified with a part of speech. That is an essential part of “understanding” textual material Now, how can we classify whole documents. – These techniques are used for spam detection, for identifying the subject matter of a news feed, and for many other tasks related to categorizing text

A supervised classifier We saw a smaller version of this in our part of speech taggers

Case study Male and female names Note this is language biased (English) These distinctions are harder given modern naming conventions – I have a granddaughter named Sydney, for example

Step 1: features and encoding Deciding what features to look for and how to represent those features is the first step, and is critical. – All the training and classification will be based on these decisions Initial choice for name identification: look at the last letter: >>> def gender_features(word):... return {'last_letter': word[-1]} >>> gender_features('Shrek') {'last_letter': 'k'} returns a dictionary (note the { } ) with a feature name and the corresponding value

First gender check import nltk def gender_features(word): return {'last_letter':word[-1]} name=raw_input("What name shall we check?") features=gender_features(name) print "Gender features for ", name, ":", features

Step 2: Provide training values We provide a list of examples and their corresponding feature values. >>> from nltk.corpus import names >>> import random >>> names = ([(name,'male') for name in names.words('male.txt')] +... [(name, 'female') for name in names.words('female.txt')]) >>> random.shuffle(names) >>> names [('Kate', 'female'), ('Eleonora', 'female'), ('Germaine', 'male'), ('Helen', 'female'), ('Rachelle', 'female'), ('Nanci', 'female'), ('Aleta', 'female'), ('Catherin', 'female'), ('Clementia', 'female'), ('Keslie', 'female'), ('Callida', 'female'), ('Horatius', 'male'), ('Kraig', 'male'), ('Cindra', 'female'), ('Jayne', 'female'), ('Fortuna', 'female'), ('Yovonnda', 'female'), ('Pam', 'female'), ('Vida', 'female'), ('Margurite', 'female'), ('Maryellen', 'female'), …

Try it. Apply the classifier to your name: Try it on the test data and see how it does: >>> featuresets = [(gender_features(n), g) for (n,g) in names] >>> train_set, test_set = featuresets[500:], featuresets[:500] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> classifier.classify(gender_features('Sydney')) 'female' >>> print nltk.classify.accuracy(classifier, test_set) 0.758

Your turn Modify the gender_features function to look at more of the name than the last letter. Does it help to look at the last two letters? the first letter? the length of the name? Try a few variations

What is most useful There is even a function to show what was most useful in the classification: >>> classifier.show_most_informative_features(10) Most Informative Features last_letter = 'k' male : female = 45.7 : 1.0 last_letter = 'a' female : male = 38.4 : 1.0 last_letter = 'f' male : female = 28.7 : 1.0 last_letter = 'v' male : female = 11.2 : 1.0 last_letter = 'p' male : female = 11.2 : 1.0 last_letter = 'd' male : female = 9.8 : 1.0 last_letter = 'm' male : female = 8.9 : 1.0 last_letter = 'o' male : female = 8.3 : 1.0 last_letter = 'r' male : female = 6.7 : 1.0 last_letter = 'g' male : female = 5.6 : 1.0

What features to use Overfitting – Being too specific about the characteristics that you search for – Picks up idiosyncrasies of the training data and may not transfer well to the test data Choose an initial feature set and then test. The chair example. What features would you use?

Dev test Divide the corpus into three parts: training, development testing, final testing

Testing stages >>> train_set = [(gender_features(n), g) for (n,g) in train_names] >>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names] >>> test_set = [(gender_features(n), g) for (n,g) in test_names] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, devtest_set) >>> train_names = names[1500:] >>> devtest_names = names[500:1500] >>> test_names = names[:500] Accuracy noted, but where were the problems? From 1500 to end First 500 items

import nltk from nltk.corpus import names import random def gender_features(word): return {'last_letter':word[-1]} names = ([(name, 'male') for name in names.words('male.txt')] + \ [(name, 'female') for name in names.words('female.txt')]) random.shuffle(names) print "Number of names: ", len(names) train_names=names[1500:] devtest_names=names[500:1500] test_names = names[:500] train_set=[(gender_features(n),g) for (n,g) in train_names] devtest_set=[(gender_features(n),g) for (n,g) in devtest_names] test_set = [(gender_features(n),g) for (n,g) in test_names] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,devtest_set) print classifier.show_most_informative_features(10)

Output from previous code Number of names: Most Informative Features last_letter = 'k' male : female = 39.7 : 1.0 last_letter = 'a' female : male = 31.4 : 1.0 last_letter = 'f' male : female = 16.0 : 1.0 last_letter = 'v' male : female = 14.1 : 1.0 last_letter = 'd' male : female = 10.3 : 1.0 last_letter = 'p' male : female = 9.8 : 1.0 last_letter = 'm' male : female = 8.6 : 1.0 last_letter = 'o' male : female = 7.8 : 1.0 last_letter = 'r' male : female = 6.6 : 1.0 last_letter = 'w' male : female = 4.8 : 1.0

Checking where the errors are Next slide

import nltk from nltk.corpus import names import random def gender_features(word): return {'last_letter':word[-1]} names = ([(name, 'male') for name in names.words('male.txt')] + \ [(name, 'female') for name in names.words('female.txt')]) random.shuffle(names) print "Number of names: ", len(names) train_names=names[1500:] devtest_names=names[500:1500] test_names = names[:500] train_set=[(gender_features(n),g) for (n,g) in train_names] devtest_set=[(gender_features(n),g) for (n,g) in devtest_names] test_set = [(gender_features(n),g) for (n,g) in test_names] classifier = nltk.NaiveBayesClassifier.train(train_set) print "Look for error cases:” errors = [] for (name,tag) in devtest_names: guess = classifier.classify(gender_features(name)) if guess != tag: errors.append((tag, guess, name)) for (tag, guess, name) in sorted(errors): print 'correct= %-8s guess= %-8s name =%-30s'%(tag,guess,name) print "Number of errors: ", len(errors) print nltk.classify.accuracy(classifier,devtest_set)

Check the classifier against the known values and see where it failed: Number of names: 7944 Look for error cases: correct= female guess= male name =Abagail correct= female guess= male name =Adrian correct= female guess= male name =Alex correct= female guess= male name =Amargo correct= female guess= male name =Anabel correct= female guess= male name =Annabal correct= female guess= male name =Annabel correct= female guess= male name =Arabel correct= female guess= male name =Ardelis …

Finding the error cases Look through the list of error cases. Do you see any patterns? Are there adjustments that we could make in our feature extractor to make it more accurate?

Error analysis It turns out that using the last two letters improves the accuracy. Did you find that in your experimentation?

Summarize the process Train on a subset of the available data – Look for characteristics that relate to the “right” answer. Write the feature extractor to look at those characteristics Run the classifier on other data – whose characteristics are known! – to see how well it performs – You have to know the answers to know whether the classifier got them right. When satisfied with the performance of the classifier, run it on new data for which you do not know the answer. – How confident can you be? The disease example. If 98% of your cases are disease free …

Document classification So far, classified names as Male/Female – Not much to work with, not much to look at Now, look at whole documents – How can you classify a document? – Subject matter in a syllabus collection, positive and negative movie/restaurant/other reviews, bias in a summary or review, subject matter in a news feed, separate works by author, … Case study, classifying movie reviews

Classifying documents To classify words (names), we looked at letters. Feature extraction for documents will use words Find the most common words in the document set and see which words are in which types of documents

import nltk import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), \ category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) cats = list(cat for cat in \ movie_reviews.categories()) print "Movie review Categories:", cats print "Number of reviews:", len(documents)

Feature extractor. Are the words present in the documents import nltk import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words= nltk.FreqDist(w.lower() for w in \ movie_reviews.words()) word_features = all_words.keys()[:2000] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)'% word] = (word in document_words) return features print document_features(movie_reviews.words('pos/cv957_8737.txt')) Line by line, what does this do? This is something different, but we have seen its like before What is this?

And if you are not sure … What do you do? – Enter the code and run it – Go to a search engine and type “Python ”

Compute accuracy and see what are the most useful feature values featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) 0.81 Most Informative Features contains(outstanding) = True pos : neg = 11.1 : 1.0 contains(seagal) = True neg : pos = 8.3 : 1.0 contains(mulan) = True pos : neg = 8.3 : 1.0 contains(damon) = True pos : neg = 8.1 : 1.0 contains(wonderfully) = True pos : neg = 6.8 : 1.0 Just as we did with classifying names Create a feature set Create a training set and a testing set Apply to new data

import nltk import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words= nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)'% word] = (word in document_words) return features featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) print classifier.show_most_informative_features(5) Full code for this example

From the text This note from the text attracted my attention: What does that suggest? Note The reason that we compute the set of all words in a document in, rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list (4.7).

The time has come … We have learned a lot of Python Something about object-oriented programming A bit about Text Analysis A bit about network programming, web crawling, servers, etc. There is lots more to all of those subjects. I am happy to review or discuss anything we did this semester. If you are doing some Python programming later and want to discuss it, I will be happy to talk to you about it.