©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Evaluation of Decision Forests on Text Categorization

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Data Mining Classification: Alternative Techniques

Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.

SVM—Support Vector Machines

Text Categorization Karl Rees Ling 580 April 2, 2001.

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)

K nearest neighbor and Rocchio algorithm

Chapter 7 – K-Nearest-Neighbor

Scalable Text Mining with Sparse Generative Models

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

1 IFT6255: Information Retrieval Text classification.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Chapter 9 – Classification and Regression Trees

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Text Mining. ©2002 Paula Matuszek Challenges and Possibilities l Information overload. There’s too much. We would like –Better retrieval –Help with handling.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Classification Techniques: Bayesian Classification

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Class Imbalance in Text Classification

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Data Mining and Decision Support

CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

KNN & Naïve Bayes Hongning Wang

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Data Mining Lecture 11.

Classification Techniques: Bayesian Classification

Text Categorization Assigning documents to a fixed set of categories

Text Categorization Berlin Chen 2003 Reference:

Information Retrieval

MIS2502: Data Analytics Classification Using Decision Trees

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek (610)

©2012 Paula Matuszek Document Classification l Document classifying –Assign documents to pre-defined categories l Examples –Process into work, personal, junk –Process documents from a newsgroup into “interesting”, “not interesting”, “spam and flames” –Process transcripts of bugged phone calls into “relevant” and “irrelevant” l Issues –Real-time? –How many categories/document? Flat or hierarchical? –Categories defined automatically or by hand?

©2012 Paula Matuszek Document Classification l Usually –relatively few categories –well defined; a person could do task easily –Categories don't change quickly l Flat vs Hierarchy –Simple classification is into mutually-exclusive document collections –Richer classification is into hierarchy with multiple inheritance –broader and narrower categories –documents can go more than one place –merges into search interfaces such as Pubmed

©2012 Paula Matuszek Classification -- Automatic l Statistical approaches l Set of “training” documents define categories –Underlying representation of document derived from text –BOW –features we discussed last time –Classification model is trained using machine learning –Individual documents classified applying the model Requires relatively little effort to create categories Accuracy heavily dependent on training examples Typically limited to flat, mutually exclusive categories

©2012 Paula Matuszek Classification: Manual l Natural Language/linguistic techniques l Categories are defined by people –underlying representation of document is typically stream of tokens –category description contains –ontology of terms and relations –pattern-matching rules –individual documents classified by pattern-matching Defining categories can be very time-consuming Typically takes some experimentation to "get it right" Can handle much more complex structures

Based on Automatic Classification Framework DocumentsPreprocessing Feature Extraction Feature filtering Applying classification algorithms Performance measure

Based on Preprocessing Preprocessing: transform documents into a suitable representation for classification task –Remove HTML or other tags –Remove stop words –Perform word stemming (Remove suffix )

Based on Feature Extraction Most crucial decision you’ll make! 1. Topic Words, phrases, ? 2. Author Stylistic features 3. Sentiment Adjectives, ? 4. Spam Specialized vocabulary Features must relate to categories

Based on Feature Filtering Feature selection: remove non-informative terms from documents =>improve classification effectiveness =>reduce computational complexity

©2012 Paula Matuszek Evaluation l We need to know how well our classification system is performing –Recall: % of documents in a class which are correctly classified as that class –r i = correctly classified as i / total which are i –Precision: % of documents classified in a class which are actually in that class –p i = correctly classified as i / total classified as i

©2012 Paula Matuszek Corpus Documents classified into category Documents actually in category Correctly Categorized

©2012 Paula Matuszek Combined Effectiveness Ideally, we want a measure that combines both precision and recall F 1 : 2pr / p+r If we accept everything, F 1 = 0 If we accept nothing, F 1 = 0 For perfect precision and recall, F 1 = 1 If either precision or recall drops, so does F 1

©2012 Paula Matuszek Measuring Individual Features l If we have a large feature set, we may be interested in which features are actually useful. l informative features: which features gives us the biggest separation between two classes l can probably omit least informative features without impacting performance l caution: correlation, not causation...

©2012 Paula Matuszek Choice of Evaluation Measure l For many tasks, F 1 gives the best overall measure –sorting news stories –deciding genre or author l But it depends on your domain –spam filters –flagging important

©2012 Paula Matuszek Evaluation: Overfitting l Training a model = predicting classification for our training set given the data in the set l Degrees of freedom: with 10 cases and 10 features I can always predict perfectly l Model may capture chance variations in set This leads to overfitting -- the model is too closely matched to the exact data set it’s been given l More likely with –large number of features –small training sets

©2012 Paula Matuszek Evaluation: Training and Test Sets l To avoid (or at least detect) overfitting we always use separate training and test sets l Model is trained on one set of examples l Evaluation measures are calculated on a different set. l sets should be comparable and each should be representative of overall corpus

©2012 Paula Matuszek Some classification methods l Common classification algorithms include –nearest neighbor (KNN) methods –decision trees –naive Bayes classifiers –linear classification classifiers (e.g., SVMs)

18 Based on K-Nearest-Neighbor Algorithm Principle: points (documents) that are close in the space belong to the same class

19 Based on K-Nearest-Neighbor Algorithm Measure of similarity between test document and each neighbor –count of words shared –tf*idf variants Select k nearest neighbors of a test document among training examples –more than 1 neighbor to avoid error of a single atypical training example –K is typically 3 or 5 Assign test document to the class which contains most of the neighbors

20 Based on Analysis of KNN Algorithm Advantages: –Effective –Can handle large, sparse vectors –“Training time” is short –Can be incremental Disadvantages: –Classification time is long –Difficult to find optimal value of k

21 Based on Decision Tree Algorithm Decision tree associated with document: –Root node contains all documents –Each internal node is subset of documents separated according to one attribute –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class

Example Decision Tree Text no Contains “Villanova” 01 >1 IrrelevantContains Wildcats yes Sports article Academic article General Article

Decision Trees for Text Each node is a single variable -- not useful for very large, very sparse vector such as BOW l Features might include –other document characteristics like diversity –counts for small subset of terms –most frequent –tf*idf –domain-based ontology

Creating a Decision Tree l At each node, choose function which provides maximum separation l If all examples at new node are one class, stop for that node l Recur with each mixed node l Stop when no choice improves separation -- or when you reach predefined level

25 Based on Analysis of Decision Tree Algorithm Advantages: –Easy to understand –Easy to train –Classification is fast Disadvantages: –Training time is relatively expensive –A document is only connected with one branch –Once a mistake is made at a higher level, any subtree is wrong –Not suited for very high dimensions

Bayesian Methods l Based on probability l Used widely in probabilistic learning and classification. l Uses prior probability of each category given no information about an item. l Categorization produces a posterior probability distribution over the possible categories given description of item.

Naive Bayes Bayes Theorem says we can determine probability of an event C given another event x based on –the overall probability of event C –the probability of event x given event C P(C|x) = P(x|C) * P(C)/p(x)

28 Based on Naïve Bayes Algorithm Estimate the probability of each class for a document: –Compute the posterior probability (Bayes rule) –Assumption of word independency (Naive assumption)

29 Based on Analysis of Naïve Bayes Algorithm Advantages: –Works well on numeric and textual data –Easy implementation and computation –has been effective in practice; typical spam filter, for instance Disadvantages: –Conditional independence assumption is in fact naive: usually violated by real-world data –performs poorly when features are highly correlated

Linear Regression l Classic linear regression: predict the value of some variable based on a weighted sum of other variables l Very common statistical technique for prediction l e.g.: predict college GPA with a weighted sum of SAT verbal and quantitative scores, high school GPA, and a “high school quality” measure

Linear Scoring Methods l Generalization of linear regression to much higher dimensionality l Goal is binary separation of instances into 2 classes l Best known is SVM: support vector machine. –classifier is a separating hyperplane –support vectors are those features which define the plane

33 Based on Support Vector Machines Main idea of SVMs Main idea of SVMs Find out the linear separating hyperplane which maximize the margin, i.e., the optimal separating hyperplane (OSH)

SVMS l Advantages: –Handle very large dimensionality –Empirically, have been shown to work well with text classification l Disadvantages –sensitive to noise, such as mislabeled training examples –binary only (but can train multiple SVMs) –implementation is complex: variety of implementation choices (similarity measure, kernel, etc) can require extensive tuning

Summary l Document classification is a common task. Manual rules provide outstanding results and allow complex structures, but very expensive to implement. l Automated methods use labeled cases to train a model –Decision trees and decision rules are easy to understand, but require good feature set tuned to domain –Nearest neighbor simple to implement and quick to train, but slow to classify. Can handle incremental training cases. –Bayes is easy to implement and works well in some domains, but can have problems with highly correlated features –SVMs more complex to implement, but handle very large dimensionality well and have proven to be best choice in many text domains