Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining CSC 576: Data Mining.

Similar presentations


Presentation on theme: "Text Mining CSC 576: Data Mining."— Presentation transcript:

1 Text Mining CSC 576: Data Mining

2 Today… Very Brief Intro to NLP What is Text Mining? Using Naïve Bayes
Natural Language Processing What is Text Mining? Using Naïve Bayes “Bag of words” model ROC Curves

3 What is it? Learning from text, instead of a traditional dataset
Extracting useful information from text

4 Dataset Differences Text is unstructured:
No columns, usually no variables .doc, .pdf, .html file types Metadata headers need to be removed

5 Terminology “Dataset” is called a corpus:
Collection of news articles from Reuters or AP Published works of Shakespeare Yelp corpus of restaurant reviews

6 Applications Text summarization Document retrieval Text categorization
From identifying key phrases To full summarization Document retrieval Example: Google Given a corpus and the user’s “information need” Text categorization Assign a category to a new document, given the training data Sentiment Analysis using Yelp data Document clustering Language identification Ascribing authorship

7 Yelp Dataset Challenge
Natural Language Processing (NLP): How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?

8 Kaggle Sentiment Analysis Competition
movie-reviews Classify the sentiment of sentences from the Rotten Tomatoes dataset

9 Basic NLP Operations Tokenization Stemming Removing Stop Words
Breaking a text file (.doc, .html) into individual words, for further processing Removing punctuation and whitespace Stemming “fishing”  “fish” “fished”  “fish” Removing Stop Words Remove common non-descriptive words: “the” “a” “I” “me”

10 Bag of Words Representation
Don’t retain ordering of words! Bag of Words Representation

11 Bag of Words for Document Classification

12 Bag of Words for Document Classification
We are interested in the probabilistic relationship between: Document d (bag of words model) Classification c (its sentiment, its topic, whether its spam, …) Want to predict: P(C|D) How can we learn probabilities P(C|D), statistically, from a corpus?

13 Application of Bayes’ Rule
Document represented as “bag of words”. What is P(D|C)? Spam filtering application: P(“homework”|NOTSPAM) P(“password”|SPAM)

14 Example:

15 Example: Priors Calculate Priors from Training Data:
P(c)=3/4 “c = CHINA” P(j)=1/4 “j = JAPAN”

16 Example: Conditional Probabilities
Calculate Conditional Probabilities from Training Data: P(Chinese|c)=5/8 P(Tokyo|c)=0/8 P(Japan|c)=0/8 P(Chinese|j)=1/3 P(Tokyo|j)=1/3 P(Japan|j)=1/3

17 Laplace Smoothing Zero-probability issue!
P(Tokyo|c)=0/8 P(Japan|c)=0/8 One solution: Laplace smoothing

18 Example: Conditional Probabilities
Calculate Conditional Probabilities from Training Data, w/ smoothing: P(Chinese|c)=(5+1)/(8+6)=3/7 P(Tokyo|c)=(0+1)/(8+6)=1/14 P(Japan|c)=(0+1)/(8+6)=1/14 P(Chinese|j)=(1+1)/(3+6)=2/9 P(Tokyo|j)=(1+1)/(3+6)=2/9 P(Japan|j)=(1+1)/(3+6)=2/9

19 Example: Classification
Compute Probabilities for Test Document: P(c|d5) = 3/4 * 3/7 * 3/7 * 3/7 * 1/14 * 1/14 = P(j|d5) = 1/4 * 2/9 * 2/9 * 2/9 * 2/9 * 2/9 = d5 should be classified as CHINA

20 ROC Curve Receiver Operating Characteristic (ROC) Curve:
So, we want a high TPR and low FPR. Receiver Operating Characteristic (ROC) Curve: Graphical approach for displaying the tradeoff between true positive rate and false positive rate of a classifier Useful for comparing the relative performance among different classifiers Drawn in 2 dimensions X-axis: false positive rate Y-axis: true positive rate FPR = FP/(TN+FP) “fraction of negative examples wrongly predicted as positive” TPR = TP/(TP+FN) “fraction of positive examples correctly predicted by model”

21 ROC Curve only needs one classifier drawn
But multiple classifiers can be included Think of M1 as C1 Classifier #1 instead of Model #1 (e.g. your decision tree w/o pruning)

22 C B If an ROC curve passed through these data points, interpretations would be: A (TPR=0, FPR=0): model predicts every instances as negative B (TPR=1, FPR=1): model predicts every instance as positive C (TPR=1, FPR=0): ideal model, no errors A

23 A (TPR=0, FPR=0): model predicts every instances as negative
If an ROC curve passed through these data points, interpretations would be: A (TPR=0, FPR=0): model predicts every instances as negative B (TPR=1, FPR=1): model predicts every instance as positive C (TPR=1, FPR=0): ideal model, no errors C B R Ideal model (close to top-left as possible) Random guessing (along main diagonal) A

24 ROC Curves are useful to compare performance of two classifiers
This graph: M1 is better when FPR < 0.36 else, M2 is superior

25 Area Under ROC Curve (AUC) Metric: Ideal model AUC = 1 Random guessing
B R Ideal model (close to top-left as possible) Random guessing (along main diagonal) A

26 Generating an ROC Curve
What’s necessary? Classifier needs to produce continuously- valued output that can be used to rank its predictions From instance that is most likely positive to instance that is least likely positive Classifiers that do this: Naïve Bayes Support Vector Machines Logistic Regression that output probabilities Test Instance # Model’s Probability Output of Instance being + 1 0.25 2 0.85 3 0.93 4 0.43 5 6 0.53

27 Generating an ROC Curve
Actual Class - + Model Output 0.76 0.93 0.95 0.85 0.25 0.87 0.43 0.53 Sort the test records in increasing order of their output values Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95

28 Generating an ROC Curve
Select the lowest ranked test record. Assign the selected record and those ranked above it to the positive class. Assign TP and FP for the current record. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 FP (Equivalent to classifying all records as +.)

29 Generating an ROC Curve
Select the next test record. Classify it and those above it as positive. Classify those below it as negative. Assign TP and FP counts for the current record. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 4 FP

30 Generating an ROC Curve
Repeat for all test records. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 4 FP

31 Generating an ROC Curve
Repeat for all test records. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 1.00 TP 5 4 3 2 1 FP TPR 0.8 0.6 0.4 0.2 FPR

32 Generating an ROC Curve
Plot the TPR against the FPR. TPR FPR TPR 1 0.8 0.6 0.4 0.2 FPR

33 References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Mining and Business Analytics with R, 1st edition, Ledolter mining/ Textmining.pdf


Download ppt "Text Mining CSC 576: Data Mining."

Similar presentations


Ads by Google