Text Mining CSC 576: Data Mining
Today… Very Brief Intro to NLP What is Text Mining? Using Naïve Bayes Natural Language Processing What is Text Mining? Using Naïve Bayes “Bag of words” model ROC Curves
What is it? Learning from text, instead of a traditional dataset Extracting useful information from text
Dataset Differences Text is unstructured: No columns, usually no variables .doc, .pdf, .html file types Metadata headers need to be removed
Terminology “Dataset” is called a corpus: Collection of news articles from Reuters or AP Published works of Shakespeare Yelp corpus of restaurant reviews
Applications Text summarization Document retrieval Text categorization From identifying key phrases To full summarization Document retrieval Example: Google Given a corpus and the user’s “information need” Text categorization Assign a category to a new document, given the training data Sentiment Analysis using Yelp data Document clustering Language identification Ascribing authorship
Yelp Dataset Challenge http://www.yelp.com/dataset_challenge Natural Language Processing (NLP): How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?
Kaggle Sentiment Analysis Competition https://www.kaggle.com/c/sentiment-analysis-on- movie-reviews Classify the sentiment of sentences from the Rotten Tomatoes dataset
Basic NLP Operations Tokenization Stemming Removing Stop Words Breaking a text file (.doc, .html) into individual words, for further processing Removing punctuation and whitespace Stemming “fishing” “fish” “fished” “fish” Removing Stop Words Remove common non-descriptive words: “the” “a” “I” “me”
Bag of Words Representation Don’t retain ordering of words! Bag of Words Representation
Bag of Words for Document Classification
Bag of Words for Document Classification We are interested in the probabilistic relationship between: Document d (bag of words model) Classification c (its sentiment, its topic, whether its spam, …) Want to predict: P(C|D) How can we learn probabilities P(C|D), statistically, from a corpus?
Application of Bayes’ Rule Document represented as “bag of words”. What is P(D|C)? Spam filtering application: P(“homework”|NOTSPAM) P(“password”|SPAM)
Example:
Example: Priors Calculate Priors from Training Data: P(c)=3/4 “c = CHINA” P(j)=1/4 “j = JAPAN”
Example: Conditional Probabilities Calculate Conditional Probabilities from Training Data: P(Chinese|c)=5/8 P(Tokyo|c)=0/8 P(Japan|c)=0/8 P(Chinese|j)=1/3 P(Tokyo|j)=1/3 P(Japan|j)=1/3
Laplace Smoothing Zero-probability issue! P(Tokyo|c)=0/8 P(Japan|c)=0/8 One solution: Laplace smoothing
Example: Conditional Probabilities Calculate Conditional Probabilities from Training Data, w/ smoothing: P(Chinese|c)=(5+1)/(8+6)=3/7 P(Tokyo|c)=(0+1)/(8+6)=1/14 P(Japan|c)=(0+1)/(8+6)=1/14 P(Chinese|j)=(1+1)/(3+6)=2/9 P(Tokyo|j)=(1+1)/(3+6)=2/9 P(Japan|j)=(1+1)/(3+6)=2/9
Example: Classification Compute Probabilities for Test Document: P(c|d5) = 3/4 * 3/7 * 3/7 * 3/7 * 1/14 * 1/14 = 0.0003 P(j|d5) = 1/4 * 2/9 * 2/9 * 2/9 * 2/9 * 2/9 = 0.0001 d5 should be classified as CHINA
ROC Curve Receiver Operating Characteristic (ROC) Curve: So, we want a high TPR and low FPR. Receiver Operating Characteristic (ROC) Curve: Graphical approach for displaying the tradeoff between true positive rate and false positive rate of a classifier Useful for comparing the relative performance among different classifiers Drawn in 2 dimensions X-axis: false positive rate Y-axis: true positive rate FPR = FP/(TN+FP) “fraction of negative examples wrongly predicted as positive” TPR = TP/(TP+FN) “fraction of positive examples correctly predicted by model”
ROC Curve only needs one classifier drawn But multiple classifiers can be included Think of M1 as C1 Classifier #1 instead of Model #1 (e.g. your decision tree w/o pruning)
C B If an ROC curve passed through these data points, interpretations would be: A (TPR=0, FPR=0): model predicts every instances as negative B (TPR=1, FPR=1): model predicts every instance as positive C (TPR=1, FPR=0): ideal model, no errors A
A (TPR=0, FPR=0): model predicts every instances as negative If an ROC curve passed through these data points, interpretations would be: A (TPR=0, FPR=0): model predicts every instances as negative B (TPR=1, FPR=1): model predicts every instance as positive C (TPR=1, FPR=0): ideal model, no errors C B R Ideal model (close to top-left as possible) Random guessing (along main diagonal) A
ROC Curves are useful to compare performance of two classifiers This graph: M1 is better when FPR < 0.36 else, M2 is superior
Area Under ROC Curve (AUC) Metric: Ideal model AUC = 1 Random guessing B R Ideal model (close to top-left as possible) Random guessing (along main diagonal) A
Generating an ROC Curve What’s necessary? Classifier needs to produce continuously- valued output that can be used to rank its predictions From instance that is most likely positive to instance that is least likely positive Classifiers that do this: Naïve Bayes Support Vector Machines Logistic Regression that output probabilities Test Instance # Model’s Probability Output of Instance being + 1 0.25 2 0.85 3 0.93 4 0.43 5 6 0.53
Generating an ROC Curve Actual Class - + Model Output 0.76 0.93 0.95 0.85 0.25 0.87 0.43 0.53 Sort the test records in increasing order of their output values Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95
Generating an ROC Curve Select the lowest ranked test record. Assign the selected record and those ranked above it to the positive class. Assign TP and FP for the current record. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 FP (Equivalent to classifying all records as +.)
Generating an ROC Curve Select the next test record. Classify it and those above it as positive. Classify those below it as negative. Assign TP and FP counts for the current record. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 4 FP
Generating an ROC Curve Repeat for all test records. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 4 FP
Generating an ROC Curve Repeat for all test records. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 1.00 TP 5 4 3 2 1 FP TPR 0.8 0.6 0.4 0.2 FPR
Generating an ROC Curve Plot the TPR against the FPR. TPR FPR TPR 1 0.8 0.6 0.4 0.2 FPR
References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Mining and Business Analytics with R, 1st edition, Ledolter http://handsondatascience.com/TextMiningO.pdf http://infospace.ischool.syr.edu/2013/04/23/what-is-text- mining/ http://www.cs.waikato.ac.nz/~ihw/papers/04-IHW- Textmining.pdf