Data Mining (and machine learning) ROC curves Rule Induction Basics of Text Mining.

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

Web Intelligence Text Mining, and web-related Applications

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Three kinds of learning

Scalable Text Mining with Sparse Generative Models

Introduction to Machine Learning Approach Lecture 5.

Internet Research Finding Free and Fee-based Obituaries Online.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Evaluating Classifiers

Advanced Multimedia Text Classification Tamara Berg.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

The identification of interesting web sites Presented by Xiaoshu Cai.

Text Classification, Active/Interactive learning.

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.

1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.

Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Class Imbalance in Text Classification

M1G Introduction to Programming 2 3. Creating Classes: Room and Item.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Evaluating Classification Performance

1 CS 430: Information Discovery Lecture 5 Ranking.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

KNN & Naïve Bayes Hongning Wang

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

Results for all features Results for the reduced set of features

Erasmus University Rotterdam

Data Mining (and machine learning)

Evaluating classifiers for disease gene discovery

From frequency to meaning: vector space models of semantics

iSRD Spam Review Detection with Imbalanced Data Distributions

Text Mining CSC 576: Data Mining.

Data Mining (and machine learning)

Introduction to Sentiment Analysis

Presentation transcript:

Data Mining (and machine learning) ROC curves Rule Induction Basics of Text Mining

Two classes is a common and special case

Medical applications: cancer, or not? Computer Vision applications: landmine, or not? Security applications: terrorist, or not? Biotech applications: gene, or not? …

Two classes is a common and special case Medical applications: cancer, or not? Computer Vision applications: landmine, or not? Security applications: terrorist, or not? Biotech applications: gene, or not? … Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly. Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly. False Negative: also to be minimised – miss a landmine / cancer very bad in many applications Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

Two classes is a common and special case True Positive: these are ideal. E.g. we correctly detect cancer False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly. False Negative: also to be minimised – miss a landmine / cancer very bad in many applications True Negative?: Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

Sensitivity and Specificity: common measures of accuracy in this kind of 2-class task Sensitivity = TP/(TP+FN) - how much of the real ‘Yes’ cases are detected? How well can it detect the condition? Specificity = TN/(FP+TN) - how much of the real ‘No’ cases are correctly classified? How well can it rule out the condition? Predicted YPredicted N Actually YTrue PositiveFalse Negative Actually NFalse PositiveTrue Negative

YES NO

Sensitivity: 100% Specificity: 25% YES NO

Sensitivity: 93.8% Specificity: 50%

YES NO Sensitivity: 81.3% Specificity: 83.3% YES NO

Sensitivity: 56.3% Specificity: 100% YES NO

Sensitivity: 100% Specificity: 25% YES NO 100% Sensitivity means: detects all cancer cases (or whatever) but possibly with many false positives

YES NO Sensitivity: 56.3% Specificity: 100% YES NO 100% Specificity means: misses some cancer cases (or whatever) but no false positives

Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks Sensitivity = TP/(TP+FN) - how much of the real TRUE cases are detected? How sensitive is the classifier to TRUE cases? A highly sensitive test for cancer: if “NO” then you be sure it’s “NO” Specificity = TN/(TN+FP) - how sensitive is the classifier to the negative cases? A highly specific test for cancer: if “Y” then you be sure it’s “Y”. With many trained classifiers, you can ‘move the line’ in this way. E.g. with NB, we could use a threshold indicating how much higher the log likelihood for Y should be than for N

ROC curves David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Rule Induction Rules are useful when you want to learn a clear / interpretable classifier, and are less worried about squeezing out as much accuracy as possible There are a number of different ways to ‘learn’ rules or rulesets. Before we go there, what is a rule / ruleset?

Rules IF Condition … Then Class Value is …

YES NO Rules are Rectangular IF (X>0)&(X 0.5)&(Y<5) THEN YES

YES NO Rules are Rectangular IF (X>5)&(X 4.5)&(Y<5.1) THEN NO

A Ruleset IF Condition1 … Then Class = A IF Condition2 … Then Class = A IF Condition3 … Then Class = B IF Condition4 … Then Class = C …

YES NO What’s wrong with this ruleset? (two things)

YES NO What about this ruleset?

Two ways to interpret a ruleset:

As a Decision List IF Condition1 … Then Class = A ELSE IF Condition2 … Then Class = A ELSE IF Condition3 … Then Class = B ELSE IF Condition4 … Then Class = C … ELSE … predict Background Majority Class

Two ways to interpret a ruleset: As an unordered set IF Condition1 … Then Class = A IF Condition2 … Then Class = A IF Condition3 … Then Class = B IF Condition4 … Then Class = C Check each rule and gather votes for each class If no winner, predict background majority class

Three broad ways to learn rulesets

1. Just build a decision tree with ID3 (or something else) and you can translate the tree into rules!

Three broad ways to learn rulesets 2. Use any good search/optimisation algorithm. Evolutionary (genetic) algorithms are the most common. You will do this coursework 3. This means simply guessing a ruleset at random, and then trying mutations and variants, gradually improving them over time.

Three broad ways to learn rulesets 3. A number of ‘old’ AI algorithms exist that still work well, and/or can be engineered to work with an evolutionary algorithm. The basic idea is: iterated coverage

YES NO Take each class in turn..

YES NO Pick a random member of that class in the training set

YES NO Extend it as much as possible without including another class

YES NO Extend it as much as possible without including another class

YES NO Extend it as much as possible without including another class

YES NO Extend it as much as possible without including another class

YES NO Next class

YES NO Next class

YES NO And so on…

Text as Data: what and why?

Students’ implementation choices for DMML CW1 “Word Clouds” - word frequency patterns provides useful information

Classify sentiment Twitter sentiment ACS Index “Word Clouds” - word frequency patterns provides useful information …which can be used to predict a class value / category / signal … in this case the document(s) are “tweets mentioning our airline over past few hours” class value is a satisfaction score, between 0 and 1

sentiment map of NYC york/sentimentmap/ more info from tweets, this time, a “happiness” score.

“similar pages” Based on distances between word frequency patterns

Predicting relationship between two people based on their text messages

Can you predict class: Desktop, Laptop or LED-TV from word frequencies of product description on amazon ?

So, word frequency is important – does this mean that the most frequent words in a text carry the most useful information about its content/category/meaning?

Zipf’s law -- text of Moby Dick --

esWithoutCollocates.txt Rank Word Part of speech Frequency Dispersion 1 the a be v and c of i a a in i to t have v to i it p I p that c for i you p he p with i on i do v say v this d they p Frequencies of words from Corpus of Contemporary American English 450 million words from fiction books, newspapers, magazines, etc… So, 22,038,615/450,000,000 = 4.9% are ‘the’ /

WithoutCollocates.txt 1000 detailn methodn signv somebodyp magazinen hoteln soldiern reflectv heavyj sexualj causen bagn heatn falln marriagen toughj

WithoutCollocates.txt 4986 kneelv vacuumn selectedj dictatev stereotypen sensorn laundryn manualn pistoln navalj immigrantj plaintiffn kidv middle-classj apologyn tilli

Zipf’s law -- text of Moby Dick -- the frequency of a specific word in text X is important, but, only if it is not similarly frequent in other texts – in which case it carries little information about X

Which leads us to TFIDF We can do any kind of DMML with text, as soon as we convert text into numbers -This is usually done with a “TFIDF” encoding -and almost always done with either TFIDF or a close relation -TFIDF is basically word frequency, but takes into account ‘background frequency’ – so words that are very common have their value reduced.

A one-slide text-mining tutorial an essay about sport an article about poltics another article about politics (0.1, 0.2, 0, ) (0.4, 0, 0.1, 0...) (0.11, 0.3, 0, ) NOW you can do Clustering, Retrieving similar Documents, Supervised Classification Etc... Vectors based on word frequencies. One key issue is to choose the right set of words (or other features)

First, a quick illustration to show why word- frequency vectors are useful

How did I get these vectors from these two `documents’? Compilers: lecture 1 This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s). Compilers The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne. 35, 2, 0 26, 2, 2

What about these two vectors? Compilers: lecture 1 This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s). Compilers The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne. 0, 0, 0, 1, 1, 1 1, 1, 1, 0, 0, 0

From this MASTER WORD LIST (ordered) (Crossword, Cryptic, Difficult, Expression, Lexical, Token) If a document contains `crossword’, it gets a 1 in position 1 of the vector, otherwise 0. If it contains `lexical’, it gets a 1 in position 5, otherwise 0, and so on. How similar would be the vectors for two docs about crossword compilers?.

Turning a document into a vector We start with a template for the vector, which needs a master list of terms. A term can be a word, or a number, or anything that appears frequently in documents. There are almost 200,000 words in English – it would take much too long to process documents vectors of that length. Commonly, vectors are made from a small number (50—1000) of most frequently-occurring words. However, the master list usually does not include words from a stoplist, Which contains words such as the, and, there, which, etc … why?

The TFIDF Encoding (Term Frequency x Inverse Document Frequency) A term is a word, or some other frequently occuring item Given some term i, and a document j, the term count is the number of times that term i occurs in document j Given a collection of k terms and a set D of documents, the term frequency, is: … considering only the terms of interest, this is the proportion of document j that is made up from term i.

The TFIDF Encoding (Term Frequency x Inverse Document Frequency) A term is a word, or some other frequently occuring item Given some term i, and a document j, the term count is the number of times that term i occurs in document j Given a collection of k terms and a set D of documents, the term frequency, is: frequency of this word in this doc total number of words in this doc … considering only the terms of interest, this is the proportion of document j that is made up from term i.

Some made-up data for illustration: TF vectors moneyinterestEUdesignerwear CATEGORY Economics Economics Fashion Fashion

Term frequency is a measure of the importance of this term in this document Inverse document frequency (which we see next) is a measure of the discriminatory value of the term in the collection of documents we are looking at.. It is a measure of the rarity of this word in this document collection E.g. high term frequency for “money” means that money is an important word in a specific document. But high document frequency (low inverse document frequency) for “money”, given a particular set of documents, means that money does not carry much useful information, since it is in many of the documents.

Inverse document frequency of term i is: where D is a master collection of documents – and C are the subset of D that contain term i at least once. E.g. if we are trying to learn a classifier of news articles into ‘sport’, ‘economics’, etc… D might be a set of 100,000 news articles Often, we simply replace idf with ‘background frequences’ obtained From a corpus such as the CCA corpus

TFIDF encoding of a document So, given: - a background collection of documents (e.g. 100,000 random web pages, all the articles we can find about cancer 100 student essays submitted as coursework …) - a specific ordered list (possibly large) of terms We can encode any document as a vector of TFIDF numbers, where the ith entry in the vector for document j is:

Some made-up data for illustration: now they are TFIDF vectors, and some have reduced more than others MoneyinterestEUdesignerwear CATEGORY Economics Economics Fashion Fashion

Vector representation of documents underpins: Many areas of automated document analysis Such as: automated classification of documents Clustering and organising document collections Building maps of the web, and of different web communities Understanding the interactions between different scientific communities, which in turn will lead to helping with automated WWW-based scientific discovery.

Example / recent work of my PhD student Hamouda Chantar

Three datasets / classification / main issue: Feature Selection DatasetArticles in Train / Test categ ories Distinct words in training set Al-Jazeera News 1200 / ,329 Alwatan821 / ,282 Akhbar - Alkhaleej 1365 / ,913

Hamouda’s work Focus on automated classification of an article (e.g. Finance, Economics, Sport, Culture,...) Emphasis on Feature Selection – which words or other features should constitute the vectors, to enable accurate classification?

Example categories: this is the Akhbar-Alkhaleej dataset CategoryTrainTestTotal International News Local news Sport Economy Total

We look at 3 pre-classified datasets Akhbar-Alkhaleej: 5690 Arabic news documents gathered evenly from the online newspaper "Akhbar-Alkhaleej" Alwatan: 20,291 Arabic news documents gathered from online newspaper "Alwatan” Al-jazeera-News:1500 documents from the Al-Jazeera news site.

is.gd/arabdata

We look at 3 classification methods (when evaluating feature subsets on the test set) C4.5: well-known decision tree classifier, we use weka’s implementation, “J48” Naive Bayes: It’s Naive, and it’s Bayes SVM: with a linear kernel

Results: Alwatan dataset

Results on Al Jazeera dataset

Results: Akhbar-Alkhaleej dataset

tara