©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
©2012 Paula Matuszek Goals l Goals for this lab are: –More Python –Run a naive Bayes classifier –Evaluate the results
Python l The Natural Language Processing with Python book covers a lot of Python, interspersed with a lot of NLP. l We are mostly interesting in the parts relevant to text mining, so we are skipping a lot. l Unfortunately that means we skip a lot of the Python, some of which we might want.
(Very) Brief Python Overview l Borrowing a presentation: l Guides/Concise Python.html Guides/Concise Python.html l To use the NLTK and do the homework assignments, you don’t actually need a lot of Python. Just plunge in. l If you need more (for your project, for instance), there is a good tutorial at l You can also work through more of the NLTK book.
Getting Your Documents In l First step is to get documents into your program. l Hopefully you have all done this. l You can give complete paths. If you’re working in Windows, either use / instead of \ or use \\ (because \ is the escape character) l At this point you have one long string.
Breaking It Down l Most of our operations expect a list of tokens, not a single string. l NLTK has a decent default tokenizer l We might also want to do things like stem it.
Classifying l Basically we: –develop a feature set. NLTK classifiers expect the input to be pairs of (hashmap of features, class) –({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') l Choose training and test documents l Run a classifier l Look at the results.
Classifying l Last time we: –developed a feature set. Dictionary of expect the input to be a dictionary of (label, value) pairs and a class. –({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') l Chosse training and test documents l Ran a classifier l Looked at the results. l Classification task was names into male and female
©2012 Paula Matuszek Goals l Goals for this lab are: –Use NLTK Naive Bayes Classifier to classify documents based on word frequency –Evaluate the results
Classifying Documents l Same set of steps l Create a feature set. –Get a frequency distribution of words in the corpus –Pick the 2000 most common –Create a feature set of “word there”, true or false. l Classify into positive and negative reviews l Evaluate results
Movie Reviews l The NLTK corpus includes a set of 2000 movie reviews, classified into directories of positive and negative. (From Cornell, released in 2004). l NLTK.corpus includes methods to get the categories of reviews, the fileids in each category and the words in each fileid.
Creating the feature set l Too many terms for us! (almost 40K) l Get a frequency count and take the most frequent. l For each of the words in that list, for each document, create a feature: – 'contains(like)': True, l Each document is a two-item list: dictionary of features, category l The featureset is a list of these documents
Doing this for your documents l Decide your features and your categories! l Input your documents and their categories. l Categories could be: –the file they are in (like names) –the directory they are in (like movie reviews) –a tag in the document itself (first token, for instance) l Build feature list for each document: a dictionary of label-value pairs –BOW, length, diversity, number of words, etc, etc. l Create a feature set which contains for each document: –a dictionary of features: label, value pairs –a category l Randomize and create training and test sets l Run it and look at results :-)