Evaluation of Decision Forests on Text Categorization

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Classification with Multiple Decision Trees
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.
Data Mining Classification: Alternative Techniques
Text Categorization Karl Rees Ling 580 April 2, 2001.
Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.
Lecture 3 Nonparametric density estimation and classification
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Ensemble Learning: An Introduction
Using IR techniques to improve Automated Text Classification
Classification.
Machine Learning: Ensemble Methods
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
1 IFT6255: Information Retrieval Text classification.
by B. Zadrozny and C. Elkan
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
K Nearest Neighbors Classifier & Decision Trees
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Optimal Bayes Classification
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Decision Trees an introduction.
Instance Based Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instance Based Learning
Parametric Methods Berlin Chen, 2005 References:
Nonparametric density estimation and classification
CAMCOS Report Day December 9th, 2015 San Jose State University
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Evaluation of Decision Forests on Text Categorization Hao Chen School of Information Mgmt. & Systems Univ. of California Berkeley, CA 94720 Tin Kam Ho Bell Labs Lucent Technologies 700 Mountain Avenue Murray Hill, NJ 07974

Text Categorization Text Collection Feature Extraction Classification Evaluation

Text Collection Reuters OHSUMED Newswires from Reuters in 1987 Training set: 9603 Test set: 3299 Categories: 95 OHSUMED Abstracts from medical journals Training set: 12327 Test set: 3616 Categories: 75 (within Heart Disease subtree)

Feature Extraction Stop Word Removal Stemming Term Selection 430 stop words Stemming Porter’s stemmer Term Selection by Document Frequency Category independent selection Category dependent selection Feature Extraction TF  IDF

Classification Method Classifiers Each document may belong to multiple categories Treating each category as a separate classification problem Binary classification Classifiers kNN (k Nearest Neighbor) C4.5 (Quinlan) Decision Forest

C4.5 A method to build decision trees Training Testing Grow the tree by splitting the data set Prune the tree back to prevent over-fitting Testing Test vector goes down the tree and arrives at a leaf. Probability that the vector belongs to each category is estimated.

Decision Forest Consisting of many decision trees combined by averaging the class probability estimates at the leaves. Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space. An oblique hyperplane is used as a discriminator at each internal node of the trees.

Why choose these 3 classifiers? We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.) kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparison We expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies

Evaluation Measurements Tradeoff between Precision and Recall YES is correct No is correct Assigned YES a b Assigned NO c d Measurements Precision p = a / (a+b) Recall r = a / (a+c) F1 value F1 = 2rp / (r+p) Tradeoff between Precision and Recall kNN tends to have higher precision than recall, especially when k becomes larger.

Averaging scores Macro-averaging Micro-averaging Calculate precision/recall for each category Average all the precision/recall values Assign equal weight to each category Micro-averaging Sum up classification decision of each document Calculate precision/recall from the summations Assign equal weight to each document This was used in experiment because the number of documents in each category varies considerably.

Performance in F1 Value

Comparison between Classifiers Decision Forest better than C4.5 and kNN In category dependent case, C4.5 better than kNN In category independent case, kNN better than C4.5

Category Dependent vs. Independent method For Decision Forest and C4.5, category dependent better than independent. But for kNN, category independent better than dependent. No obvious explanation found.

Reuters vs. OHSUMED All classifiers degrades from Reuters to OHSUMED kNN degrades faster(26%) than C4.5(12%) and DF(12%)

Reuters vs. OHSUMED OHSUMED is a harder problem because: Documents are more evenly distributed This even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.

Conclusion Decision Forest is substantially better than C4.5 and kNN in text categorization Difficult to make comparison with results of other classifiers outside this experiment, because Different ways of spliting training/test set Different term selection methods