Text Mining CSC 576: Data Mining.

Slides:



Advertisements
Similar presentations
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Advertisements

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Model Evaluation Metrics for Performance Evaluation
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
ROC Curves.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
ROC & AUC, LIFT ד"ר אבי רוזנפלד.
Scalable Text Mining with Sparse Generative Models
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Evaluation – next steps
Bayesian Networks. Male brain wiring Female brain wiring.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CSC 594 Topics in AI – Text Mining and Analytics
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Evaluating Classification Performance
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Text Classification and Naïve Bayes Multinomial Naïve Bayes: A Worked Example.
A Smart Tool to Predict Salary Trends of H1-B Holders
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Evaluating Classifiers
Evaluation – next steps
Text Mining CSC 600: Data Mining Class 20.
Logistic Regression CSC 600: Data Mining Class 14.
Performance Evaluation 02/15/17
7CCSMWAL Algorithmic Issues in the WWW
CSSE463: Image Recognition Day 11
CSE 4705 Artificial Intelligence
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Data Mining 101 with Scikit-Learn
Analyzing and Interpreting Quantitative Data
Performance Measures II
Naive Bayesian Classification
Lecture 15: Text Classification & Naive Bayes
Data Mining Classification: Alternative Techniques
Features & Decision regions
Advanced Analytics. Advanced Analytics What is Machine Learning?
Machine Learning in Natural Language Processing
Naïve Bayes CSC 600: Data Mining Class 19.
CSSE463: Image Recognition Day 11
iSRD Spam Review Detection with Imbalanced Data Distributions
Course Introduction CSC 576: Data Mining.
Nearest Neighbors CSC 576: Data Mining.
Computational Intelligence: Methods and Applications
Data Mining Class Imbalance
CSSE463: Image Recognition Day 11
Roc curves By Vittoria Cozza, matr
CSSE463: Image Recognition Day 11
Naïve Bayes CSC 576: Data Science.
Introduction to Sentiment Analysis
More on Maxent Env. Variable importance:
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Presentation transcript:

Text Mining CSC 576: Data Mining

Today… Very Brief Intro to NLP What is Text Mining? Using Naïve Bayes Natural Language Processing What is Text Mining? Using Naïve Bayes “Bag of words” model ROC Curves

What is it? Learning from text, instead of a traditional dataset Extracting useful information from text

Dataset Differences Text is unstructured: No columns, usually no variables .doc, .pdf, .html file types Metadata headers need to be removed

Terminology “Dataset” is called a corpus: Collection of news articles from Reuters or AP Published works of Shakespeare Yelp corpus of restaurant reviews

Applications Text summarization Document retrieval Text categorization From identifying key phrases To full summarization Document retrieval Example: Google Given a corpus and the user’s “information need” Text categorization Assign a category to a new document, given the training data Sentiment Analysis using Yelp data Document clustering Language identification Ascribing authorship

Yelp Dataset Challenge http://www.yelp.com/dataset_challenge Natural Language Processing (NLP): How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?

Kaggle Sentiment Analysis Competition https://www.kaggle.com/c/sentiment-analysis-on- movie-reviews Classify the sentiment of sentences from the Rotten Tomatoes dataset

Basic NLP Operations Tokenization Stemming Removing Stop Words Breaking a text file (.doc, .html) into individual words, for further processing Removing punctuation and whitespace Stemming “fishing”  “fish” “fished”  “fish” Removing Stop Words Remove common non-descriptive words: “the” “a” “I” “me”

Bag of Words Representation Don’t retain ordering of words! Bag of Words Representation

Bag of Words for Document Classification

Bag of Words for Document Classification We are interested in the probabilistic relationship between: Document d (bag of words model) Classification c (its sentiment, its topic, whether its spam, …) Want to predict: P(C|D) How can we learn probabilities P(C|D), statistically, from a corpus?

Application of Bayes’ Rule Document represented as “bag of words”. What is P(D|C)? Spam filtering application: P(“homework”|NOTSPAM) P(“password”|SPAM)

Example:

Example: Priors Calculate Priors from Training Data: P(c)=3/4 “c = CHINA” P(j)=1/4 “j = JAPAN”

Example: Conditional Probabilities Calculate Conditional Probabilities from Training Data: P(Chinese|c)=5/8 P(Tokyo|c)=0/8 P(Japan|c)=0/8 P(Chinese|j)=1/3 P(Tokyo|j)=1/3 P(Japan|j)=1/3

Laplace Smoothing Zero-probability issue! P(Tokyo|c)=0/8 P(Japan|c)=0/8 One solution: Laplace smoothing

Example: Conditional Probabilities Calculate Conditional Probabilities from Training Data, w/ smoothing: P(Chinese|c)=(5+1)/(8+6)=3/7 P(Tokyo|c)=(0+1)/(8+6)=1/14 P(Japan|c)=(0+1)/(8+6)=1/14 P(Chinese|j)=(1+1)/(3+6)=2/9 P(Tokyo|j)=(1+1)/(3+6)=2/9 P(Japan|j)=(1+1)/(3+6)=2/9

Example: Classification Compute Probabilities for Test Document: P(c|d5) = 3/4 * 3/7 * 3/7 * 3/7 * 1/14 * 1/14 = 0.0003 P(j|d5) = 1/4 * 2/9 * 2/9 * 2/9 * 2/9 * 2/9 = 0.0001 d5 should be classified as CHINA

ROC Curve Receiver Operating Characteristic (ROC) Curve: So, we want a high TPR and low FPR. Receiver Operating Characteristic (ROC) Curve: Graphical approach for displaying the tradeoff between true positive rate and false positive rate of a classifier Useful for comparing the relative performance among different classifiers Drawn in 2 dimensions X-axis: false positive rate Y-axis: true positive rate FPR = FP/(TN+FP) “fraction of negative examples wrongly predicted as positive” TPR = TP/(TP+FN) “fraction of positive examples correctly predicted by model”

ROC Curve only needs one classifier drawn But multiple classifiers can be included Think of M1 as C1 Classifier #1 instead of Model #1 (e.g. your decision tree w/o pruning)

C B If an ROC curve passed through these data points, interpretations would be: A (TPR=0, FPR=0): model predicts every instances as negative B (TPR=1, FPR=1): model predicts every instance as positive C (TPR=1, FPR=0): ideal model, no errors A

A (TPR=0, FPR=0): model predicts every instances as negative If an ROC curve passed through these data points, interpretations would be: A (TPR=0, FPR=0): model predicts every instances as negative B (TPR=1, FPR=1): model predicts every instance as positive C (TPR=1, FPR=0): ideal model, no errors C B R Ideal model (close to top-left as possible) Random guessing (along main diagonal) A

ROC Curves are useful to compare performance of two classifiers This graph: M1 is better when FPR < 0.36 else, M2 is superior

Area Under ROC Curve (AUC) Metric: Ideal model AUC = 1 Random guessing B R Ideal model (close to top-left as possible) Random guessing (along main diagonal) A

Generating an ROC Curve What’s necessary? Classifier needs to produce continuously- valued output that can be used to rank its predictions From instance that is most likely positive to instance that is least likely positive Classifiers that do this: Naïve Bayes Support Vector Machines Logistic Regression that output probabilities Test Instance # Model’s Probability Output of Instance being + 1 0.25 2 0.85 3 0.93 4 0.43 5 6 0.53

Generating an ROC Curve Actual Class - + Model Output 0.76 0.93 0.95 0.85 0.25 0.87 0.43 0.53 Sort the test records in increasing order of their output values Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95

Generating an ROC Curve Select the lowest ranked test record. Assign the selected record and those ranked above it to the positive class. Assign TP and FP for the current record. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 FP (Equivalent to classifying all records as +.)

Generating an ROC Curve Select the next test record. Classify it and those above it as positive. Classify those below it as negative. Assign TP and FP counts for the current record. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 4 FP

Generating an ROC Curve Repeat for all test records. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 Assign TP 5 4 FP

Generating an ROC Curve Repeat for all test records. Actual Class + - Model Output 0.25 0.43 0.53 0.76 0.85 0.87 0.93 0.95 1.00 TP 5 4 3 2 1 FP TPR 0.8 0.6 0.4 0.2 FPR

Generating an ROC Curve Plot the TPR against the FPR. TPR FPR TPR 1 0.8 0.6 0.4 0.2 FPR

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Mining and Business Analytics with R, 1st edition, Ledolter http://handsondatascience.com/TextMiningO.pdf http://infospace.ischool.syr.edu/2013/04/23/what-is-text- mining/ http://www.cs.waikato.ac.nz/~ihw/papers/04-IHW- Textmining.pdf