Classifying text NLTK Chapter 6. Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can.

Slides:



Advertisements
Similar presentations
Keyboarding Vocabulary III Finals Study Guide Basic Computer.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Classifying text NLTK Chapter 6. Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can.
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Perceptron Learning Rule
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Programming for Linguists An Introduction to Python 15/12/2011.
How to Solve Test Problems Test Taking Strategy
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Chapter 2: Pattern Recognition
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Open Information Extraction From The Web Rani Qumsiyeh.
Chapter 2: Design of Algorithms
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Introduction to Machine Learning Approach Lecture 5.
Introduction to machine learning
Document Classification using the Natural Language Toolkit Ben Healey
Albert Gatt Corpora and Statistical Methods Lecture 9.
ELN – Natural Language Processing Giuseppe Attardi
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Machine-Assisted Indexing Week 12 LBSC 671 Creating Information Infrastructures.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Text classification Day 35 LING Computational Linguistics Harry Howard Tulane University.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Natural language processing tools Lê Đức Trọng 1.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Chapter 2 The Nature of Learner Language By : Annisa Mustikanthi.
Introduction to Classification & Clustering Villanova University Machine Learning Lab Module 4.
A PRACTICAL GUIDE TO NATURAL LANGUAGE PROCESSING Emily Daniels May 2014.
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Classification & Clustering
Recitation #3 Tel Aviv University 2016/2017 Slava Novgorodov
Session 7: Face Detection (cont.)
Chapter 11: Learning Introduction
Text Categorization Rong Jin.
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Lecture 6: Introduction to Machine Learning
Evaluating Classifiers
TURKISH Sentıment Analysıs on twıtter data
Tokenizing Search/regex Statistics
Text Mining Application Programming Chapter 9 Text Categorization
Presentation transcript:

Classifying text NLTK Chapter 6

Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can we construct models of language that can be used to perform language processing tasks automatically? What can we learn about language from these models?

From words to larger units We looked at how words are indentified with a part of speech. That is an essential part of “understanding” textual material Now, how can we classify whole documents. – These techniques are used for spam detection, for identifying the subject matter of a news feed, and for many other tasks related to categorizing text

A supervised classifier We saw a smaller version of this in our part of speech taggers

Case study Male and female names Note this is language biased (English) These distinctions are harder given modern naming conventions – I have a granddaughter named Sydney, for example

Step 1: features and encoding Deciding what features to look for and how to represent those features is the first step, and is critical. – All the training and classification will be based on these decisions Initial choice for name identification: look at the last letter: >>> def gender_features(word):... return {'last_letter': word[-1]} >>> gender_features('Shrek') {'last_letter': 'k'} returns a dictionary (note the { } ) with a feature name and the corresponding value

Step 2: Provide training values We provide a list of examples and their corresponding feature values. >>> from nltk.corpus import names >>> import random >>> names = ([(name,'male') for name in names.words('male.txt')] +... [(name, 'female') for name in names.words('female.txt')]) >>> random.shuffle(names) >>> names [('Kate', 'female'), ('Eleonora', 'female'), ('Germaine', 'male'), ('Helen', 'female'), ('Rachelle', 'female'), ('Nanci', 'female'), ('Aleta', 'female'), ('Catherin', 'female'), ('Clementia', 'female'), ('Keslie', 'female'), ('Callida', 'female'), ('Horatius', 'male'), ('Kraig', 'male'), ('Cindra', 'female'), ('Jayne', 'female'), ('Fortuna', 'female'), ('Yovonnda', 'female'), ('Pam', 'female'), ('Vida', 'female'), ('Margurite', 'female'), ('Maryellen', 'female'), …

Try it. Apply the classifier to your name: Try it on the test data and see how it does: >>> featuresets = [(gender_features(n), g) for (n,g) in names] >>> train_set, test_set = featuresets[500:], featuresets[:500] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> classifier.classify(gender_features('Sydney')) 'female' >>> print nltk.classify.accuracy(classifier, test_set) 0.758

Your turn Modify the gender_features function to look at more of the name than the last letter. Does it help to look at the last two letters? the first letter? the length of the name? Try a few variations

What is most useful There is even a function to show what was most useful in the classification: >>> classifier.show_most_informative_features(10) Most Informative Features last_letter = 'k' male : female = 45.7 : 1.0 last_letter = 'a' female : male = 38.4 : 1.0 last_letter = 'f' male : female = 28.7 : 1.0 last_letter = 'v' male : female = 11.2 : 1.0 last_letter = 'p' male : female = 11.2 : 1.0 last_letter = 'd' male : female = 9.8 : 1.0 last_letter = 'm' male : female = 8.9 : 1.0 last_letter = 'o' male : female = 8.3 : 1.0 last_letter = 'r' male : female = 6.7 : 1.0 last_letter = 'g' male : female = 5.6 : 1.0

What features to use Overfitting – Being too specific about the characteristics that you search for – Picks up idiosyncrasies of the training data and may not transfer well to the test data Choose an initial feature set and then test.

Dev test Divide the corpus into three parts: training, development testing, final testing

Testing stages >>> train_set = [(gender_features(n), g) for (n,g) in train_names] >>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names] >>> test_set = [(gender_features(n), g) for (n,g) in test_names] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, devtest_set) >>> train_names = names[1500:] >>> devtest_names = names[500:1500] >>> test_names = names[:500] Accuracy noted, but where were the problems?

Check the classifier against the known values and see where it failed: >>> errors = [] >>> for (name, tag) in devtest_names:... guess = classifier.classify(gender_features(name))... if guess != tag:... errors.append( (tag, guess, name) ) >>> for (tag, guess, name) in sorted(errors):... print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)... correct=female guess=male name=Cindely... correct=female guess=male name=Katheryn correct=female guess=male name=Kathryn... correct=male guess=female name=Aldrich... correct=male guess=female name=Mitch

Error analysis It turns out that using the last two letters improves the accuracy. Did you find that in your experimentation?

Document classification Many uses. Case study, classifying movie reviews >>> from nltk.corpus import movie_reviews >>> documents = [(list(movie_reviews.words(fileid)), category)... for category in movie_reviews.categories()... for fileid in movie_reviews.fileids(category)] >>> random.shuffle(documents) Feature extraction for documents will use words Find most common words in the document set and see which words are in which types of documents

Feature extractor. Are the words present in the documents all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features >>> print document_features(movie_reviews.words('pos/cv957_8737.txt')) {'contains(waste)': False, 'contains(lot)': False,...}

Compute accuracy and see what are the most useful feature values featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, test_set) 0.81 >>> classifier.show_most_informative_features(5) Most Informative Features contains(outstanding) = True pos : neg = 11.1 : 1.0 contains(seagal) = True neg : pos = 7.7 : 1.0 contains(wonderfully) = True pos : neg = 6.8 : 1.0 contains(damon) = True pos : neg = 5.9 : 1.0 contains(wasted) = True neg : pos = 5.8 : 1.0

There is more As time allows, let’s look at other sections of this chapter. We do not have time to do justice to all the topics, but we can take a few and look into them.