Download presentation
Presentation is loading. Please wait.
Published byMorgan Tyler Modified over 9 years ago
1
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789
2
©2012 Paula Matuszek Goals l Goals for this lab are: –More Python –Run a naive Bayes classifier –Evaluate the results
3
Python l The Natural Language Processing with Python book covers a lot of Python, interspersed with a lot of NLP. l We are mostly interesting in the parts relevant to text mining, so we are skipping a lot. l Unfortunately that means we skip a lot of the Python, some of which we might want.
4
(Very) Brief Python Overview l Borrowing a presentation: l http://www.cis.upenn.edu/~matuszek/Concise Guides/Concise Python.html http://www.cis.upenn.edu/~matuszek/Concise Guides/Concise Python.html l To use the NLTK and do the homework assignments, you don’t actually need a lot of Python. Just plunge in. l If you need more (for your project, for instance), there is a good tutorial at http://docs.python.org/tutorial/ http://docs.python.org/tutorial/ l You can also work through more of the NLTK book.
5
Getting Your Documents In l First step is to get documents into your program. l Hopefully you have all done this. l You can give complete paths. If you’re working in Windows, either use / instead of \ or use \\ (because \ is the escape character) l At this point you have one long string.
6
Breaking It Down l Most of our operations expect a list of tokens, not a single string. l NLTK has a decent default tokenizer l We might also want to do things like stem it.
7
Classifying l Basically we: –develop a feature set. NLTK classifiers expect the input to be pairs of (hashmap of features, class) –({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') l Choose training and test documents l Run a classifier l Look at the results.
8
Classifying l Last time we: –developed a feature set. Dictionary of expect the input to be a dictionary of (label, value) pairs and a class. –({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') l Chosse training and test documents l Ran a classifier l Looked at the results. l Classification task was names into male and female
9
©2012 Paula Matuszek Goals l Goals for this lab are: –Use NLTK Naive Bayes Classifier to classify documents based on word frequency –Evaluate the results
10
Classifying Documents l Same set of steps l Create a feature set. –Get a frequency distribution of words in the corpus –Pick the 2000 most common –Create a feature set of “word there”, true or false. l Classify into positive and negative reviews l Evaluate results
11
Movie Reviews l The NLTK corpus includes a set of 2000 movie reviews, classified into directories of positive and negative. (From Cornell, released in 2004). l NLTK.corpus includes methods to get the categories of reviews, the fileids in each category and the words in each fileid.
12
Creating the feature set l Too many terms for us! (almost 40K) l Get a frequency count and take the most frequent. l For each of the words in that list, for each document, create a feature: – 'contains(like)': True, l Each document is a two-item list: dictionary of features, category l The featureset is a list of these documents
13
Doing this for your documents l Decide your features and your categories! l Input your documents and their categories. l Categories could be: –the file they are in (like names) –the directory they are in (like movie reviews) –a tag in the document itself (first token, for instance) l Build feature list for each document: a dictionary of label-value pairs –BOW, length, diversity, number of words, etc, etc. l Create a feature set which contains for each document: –a dictionary of features: label, value pairs –a category l Randomize and create training and test sets l Run it and look at results :-)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.