©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610) 647-9789.

Slides:



Advertisements
Similar presentations
University of Sheffield NLP Module 11: Advanced Machine Learning.
Advertisements

Programming for Linguists
Classifying text NLTK Chapter 6. Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can.
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Programming for Linguists An Introduction to Python 15/12/2011.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Classifying text NLTK Chapter 6. Chapter 6 topics How can we identify particular features of language data that are salient for classifying it? How can.
©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek.
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
How does computer know what is spam and what is ham?
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html Natural Language Toolkit.
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
Bayesian Networks. Male brain wiring Female brain wiring.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.
Text Classification, Active/Interactive learning.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Text classification Day 35 LING Computational Linguistics Harry Howard Tulane University.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Chapter 23: Probabilistic Language Models April 13, 2004.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Automatic recognition of discourse relations Lecture 3.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab Christina Wallin, Period 3 Computer Systems Research Lab
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 2 Dr. Paula Matuszek (610)
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
©2003 Paula Matuszek CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek (610)
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab
Text Mining CSC 600: Data Mining Class 20.
School of Computing Clemson University Fall, 2012
Natural Language Processing (NLP)
Naive Bayesian Classification
CS 8520: Artificial Intelligence
Lecture 21 Computational Lexical Semantics
Classifying enterprises by economic activity
LING 388: Computers and Language
WORDS Lab CSC 9010: Special Topics. Natural Language Processing.
CSCI 5832 Natural Language Processing
Text Mining CSC 576: Data Mining.
Natural Language Processing (NLP)
SNoW & FEX Libraries; Document Classification
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)

©2012 Paula Matuszek Goals l Goals for this lab are: –More Python –Run a naive Bayes classifier –Evaluate the results

Python l The Natural Language Processing with Python book covers a lot of Python, interspersed with a lot of NLP. l We are mostly interesting in the parts relevant to text mining, so we are skipping a lot. l Unfortunately that means we skip a lot of the Python, some of which we might want.

(Very) Brief Python Overview l Borrowing a presentation: l Guides/Concise Python.html Guides/Concise Python.html l To use the NLTK and do the homework assignments, you don’t actually need a lot of Python. Just plunge in. l If you need more (for your project, for instance), there is a good tutorial at l You can also work through more of the NLTK book.

Getting Your Documents In l First step is to get documents into your program. l Hopefully you have all done this. l You can give complete paths. If you’re working in Windows, either use / instead of \ or use \\ (because \ is the escape character) l At this point you have one long string.

Breaking It Down l Most of our operations expect a list of tokens, not a single string. l NLTK has a decent default tokenizer l We might also want to do things like stem it.

Classifying l Basically we: –develop a feature set. NLTK classifiers expect the input to be pairs of (hashmap of features, class) –({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') l Choose training and test documents l Run a classifier l Look at the results.

Classifying l Last time we: –developed a feature set. Dictionary of expect the input to be a dictionary of (label, value) pairs and a class. –({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') l Chosse training and test documents l Ran a classifier l Looked at the results. l Classification task was names into male and female

©2012 Paula Matuszek Goals l Goals for this lab are: –Use NLTK Naive Bayes Classifier to classify documents based on word frequency –Evaluate the results

Classifying Documents l Same set of steps l Create a feature set. –Get a frequency distribution of words in the corpus –Pick the 2000 most common –Create a feature set of “word there”, true or false. l Classify into positive and negative reviews l Evaluate results

Movie Reviews l The NLTK corpus includes a set of 2000 movie reviews, classified into directories of positive and negative. (From Cornell, released in 2004). l NLTK.corpus includes methods to get the categories of reviews, the fileids in each category and the words in each fileid.

Creating the feature set l Too many terms for us! (almost 40K) l Get a frequency count and take the most frequent. l For each of the words in that list, for each document, create a feature: – 'contains(like)': True, l Each document is a two-item list: dictionary of features, category l The featureset is a list of these documents

Doing this for your documents l Decide your features and your categories! l Input your documents and their categories. l Categories could be: –the file they are in (like names) –the directory they are in (like movie reviews) –a tag in the document itself (first token, for instance) l Build feature list for each document: a dictionary of label-value pairs –BOW, length, diversity, number of words, etc, etc. l Create a feature set which contains for each document: –a dictionary of features: label, value pairs –a category l Randomize and create training and test sets l Run it and look at results :-)