Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Outline 1.Download and Install Python 2.Download and Install NLTK 3.Download and Unzip Project Files 4.Simple Naïve Bayes Classifier 5.Demo collecting tweets -- > Evaluation 6.Other things you can do …

Download and Install Python http://www.python.org/getit/ (latest version 2.7.3) http://pypi.python.org/pypi/setuptools (install the setup tools for 2.7)

Download and Install NLTK Install PyYAML: http://pyyaml.org/wiki/PyYAML Install NUMPY: http://numpy.scipy.org/ Install NLTK: http://pypi.python.org/pypi/nltk Install MatPlotLib: http://matplotlib.org/

Test Installation Run python At the prompt type >>import nltk >>import matplotlib

Downloading Models >> nltk.download() Open GUI downloader Select Models tab and download: maxent_ne_chunker maxent_treebank_pos_tagger hmm_treebank_pos_tagger Select Corpora tab and download: stopwords Alternatively Select Collections, click on all, and click the button to download all

Getting Started Unzip project directory (lab1.zip) Change to the lab1 directory Open command window in the lab1 directory Windows 7 and later – Hold SHIFT; right-click in directory, select Open command window here Unix/Mac – Open terminal; cd PATH/TO/lab1 Type python and then in terminal >> import text_processing as tp >> import nltk Note: text_processing comes from your lab1 folder Note: You must work from your lab1 directory

Downloading Models >> nltk.download() Open GUI downloader Select Models tab and download: maxent_ne_chunker maxent_treebank_pos_tagger hmm_treebank_pos_tagger Select Corpora tab and download: stopwords Alternatively Select Collections, click on all, and click the button to download all

Simple NB Sentiment Classifier

Read in tweets CALL >> paths = [neg_examples.txt', pos_examples.txt'] >> documentClasses = [neg', pos'] >> tweetSet = [tp.loadTweetText(p) for p in paths] SAMPLE OUTPUT >> len(tweetSet[0]), len(tweetSet[1]) (20000, 40000) >> tweetSet[1][50] "@davidarchie hey david !me and my bestfriend are forming a band.could you give us any advice please? it's means a lot for us :)"

Read in tweets (Code) Reads in a file and treats each line as a tweet, lower-casing the text

Tokenize CALL >> tokenSet = [tp.tokenizeTweets(tweets) for tweets in tweetSet] SAMPLE OUTPUT >> len(tokenSet[1][50]) 31 >> tokenSet[1][50] ['@', 'davidarchie', 'hey', 'david', '!', 'me', 'and', 'my', 'bestfriend', 'are', 'forming', 'a', 'band', '.', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', '?', 'it', "'", 's', 'means', 'a', 'lot', 'for', 'us', ':)']

Tokenize (Code) For each tweet, splits the tweet by whitespace. Splits off punctuation separately. (nltk.WordPunctTokenizer)

Filter out Non-English CALL >> englishSet = [tp.filterOnlyEnglish(tokens) for tokens in tokenSet] SAMPLE OUTPUT >> len(englishSet[1][50]) 22 >> englishSet[1][50] ['hey', 'david', 'me', 'and', 'my', 'are', 'forming', 'a', 'band', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', 'it', 'means', 'a', 'lot', 'for', 'us']

Filter out Non-English (Code) Reads in a dictionary file of English words – wordsEn.txt – and only keeps tokens in that dictionary

Filter out Stopwords SAMPLE OUTPUT >> len(noStopSet[1][50]) 12 >> noStopSet[1][50] ['hey', 'david', 'forming', 'band', 'could', 'give', 'us', 'advice', 'please', 'means', 'lot', 'us'] CALL >> noStopSet = [tp.removeStopwords(tokens, [':)', ':(']) for tokens in englishSet]

Filter out Stopwords (Code) Loads stop word list, and removes tokens that are in stop words. Also able to pass additional words as stop words using the addtlStopwords argument

Stem CALL >> stemmedSet = [tp.stemTokens(tokens) for tokens in noStopSet] SAMPLE OUTPUT >> len(stemmedSet[1][50]) 12 >> stemmedSet[1][50] ['hey', 'david', 'form', 'band', 'could', 'give', 'us', 'advic', 'pleas', 'mean', 'lot', 'us']

Stem (Code) Loads a Porter stemmer implementation to remove suffixes from tokens. http://nltk.org/api/nltk.stem.html for more information on NLTK's stemmers.http://nltk.org/api/nltk.stem.html

Make Bags of Words CALL >> bagsOfWords = [tp.makeBagOfWords( tokens, documentClass=docClass) for docClass, tokens in zip(documentClasses, stemmedSet)] SAMPLE OUTPUT >> bagsOfWords[1][50][0].items() [('us', 2), ('advic', 1), ('band', 1), ('could', 1), ('david', 1), ('form', 1), ('give', 1), ('hey', 1), ('lot', 1), ('mean', 1), ('pleas', 1)]

Make Bags of Words (Code) For each tweet, constructs a bag of words (FreqDist) that counts that number of times each token occurs. Setting the bigrams argument to True will also include bigrams in the bags.

Make Train and Test CALL >> trainSet, testSet = tp.makeTrainAndTest( reduce(lambda x, y: x + y, bagsOfWords), cutoff=0.9) SAMPLE OUTPUT >> len(trainSet), len(testSet) (50697, 5633)

Make Train and Test (Code) Given all of your examples, randomly selects proportion cutoff of examples for training, and 1-cutoff examples for testing.

Train Classifier CALL >> import nltk.classify.util >> nbClassifier = tp.trainNBClassifier(trainSet, testSet) SAMPLE OUTPUT >> classifier.show_most_informative_features(n=20) …..

Train Classifier (Code) Trains a Naive Bayes classifier over the input training set. Prints out the accuracy over the test set and prints tokens the most discriminating tokens.

Twitter Collection Demo

Directions for Collection Demo Now, try out twitter_kw_stream.py to collect more tweets over a couple of different classes. Some possible tokens (with high volume) apple google cat dog pizza ice cream Open a new terminal window in the same directory For each keyword – KW -- you search for, type: python twitter_kw_stream.py --keywords=KW Wait a minute or so (till you retrieve about 100 tweets)

Accuracy given training data size Assuming keywords searched for were: apple google cat dog pizza ice cream In terminal already running Python interpreter: >> paths = ['apple.txt', 'google.txt', 'cat.txt', 'dog.txt', 'pizza.txt', 'ice cream.txt'] >> addtlStopwords = ['apple', 'google', 'cat', 'dog', 'pizza', 'ice', 'cream'] >> cutoffs, uniAccs, biAccs = tp.plotAccuracy(paths, addtlStopwords=addtlStopwords) If matplotlib installed correctly, should display accuracy of NB classifier while varying the amount training data, with and without using bigrams. Saved to nbPlot.png

Other things you can do

Get Document Similarity CALL >> docs = tp.loadDocuments(paths) >> sims = tp.getDocumentSimilarities(paths, [p.replace('.txt', '') for p in paths]) SAMPLE OUTPUT >> sims[('apple', 'dog')] 0.30735795122824466 >> sims[('apple', 'google')] 0.44204540065105324

Get Document Similarity (Code) Calculates cosine similarity for each bag of word pair. Dot product of the two frequency vectors (after normalizing each to a unit vector).

Calculate TF-IDF CALL >> tfIdfs = tp.getTfIdfs(docs) SAMPLE OUTPUT >> for path, tfIdf in zip(paths, tfIdfs): … print 'Top 10 TF-IDF for %s: %s' %(path, '\n'.join([str(t) for t in tfIdf]))

Part-of-Speech Tag CALL >> posneTagged = [[tp.partOfSpeechTag(ts) for ts in classTokens[:100]] for tokens in tokenSet] SAMPLE OUTPUT >> posSet[1][50] [('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')]

Part-of-Speech Tag (Code Very simple, as long as you have a string of tokens, can just call nltk.pos_tag(tokens) to tag them with part-of-speech.

Find Named-Entities CALL >> neSet = [[tp.getNamedEntityTree(ts) for ts in classTokens[:100]] for tokens in tokenSet] SAMPLE OUTPUT >> neSet[1][50] Tree('S', [('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')])

Find Named-Entities Similarly simple, just call two NLTK functions. The performance of POS tagger and NE chunker are quite bad for Twitter messages, however.

Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Similar presentations

Presentation on theme: "Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Similar presentations

Presentation on theme: "Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)"— Presentation transcript:

Similar presentations

About project

Feedback