Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Feature selection for text categorization on imbalanced.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Introduction to Machine Learning Approach Lecture 5.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Looking inside self-organizing map ensembles with resampling.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Web 2.0-based collaborative annotation system for enhancing knowledge sharing in collaborative learning.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
© Copyright McGraw-Hill 2004
Class Imbalance in Text Classification
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An Integrated Machine Learning Approach to Stroke Prediction Presenter: Tsai Tzung Ruei Authors: Aditya.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Dr. Sampath Jayarathna Cal Poly Pomona
Dr. Sampath Jayarathna Cal Poly Pomona
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Feature selection for text categorization on imbalanced data Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Zhaohui Zheng, Xiaoyun Wu, Rohini Srihari

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2  Motivation  Objective  Introduction  Feature selection framework  Experimental setup  Primary results and analysis  Conclusions  Future work Outline

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Feature selection using one-sided metrics selects the features most indicative of membership only  Feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  We investigate the usefulness of explicit control of that combination within a proposed feature selection framework.  Our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction  One choice in the feature selection policy is whether to rule out all negative features.  Others believe that negative features are numerous, given the imbalanced data set, and quite valuable in practical experience.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Introduction (cont.)  negative features are useful because their presence in a document highly indicates its non-relevance.  They help to confidently reject non-relevant documents.  When deprived of negative features, the performance of all feature selection metrics degrades, which indicates negative features are essential to high quality classification.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Introduction (cont.)  neither one-sided nor two-sided metrics themselves allow control of the combination.  The focus in this paper is to answer the following three questions with empirical evidence: ─ How sub-optimal are two sided metrics? ─ To what extent can the performance be improved by better combination of positive and negative features? ─ How can the optimal combination be learned in practice?

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Feature selection metrics  Correlation Coefficient (CC), and Odds Ratio (OR) are one-sided metrics which select the features most indicative of membership for a category only,  IG and CHI are two-sided metrics, which consider the features most indicative of either membership (e.g. positive features) or non- membership (e.g. negative features).

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Feature selection metrics (cont.) positive negative t 陳水扁 t 無「陳水扁」 Ci: 政治類 PositiveNegative Ci: 非政治類 NegativePositive

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Feature selection metrics (cont.)  Information gain (IG) ─ Measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document.  Chi-square (CHI) ─ measures the lack of independence between a term t and a category c i and can be compared to the chi-square distribution with one degree of freedom to judge extremeness

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Feature selection metrics (cont.)  Correlation coefficient (CC) ─ Correlation coefficient of a word t with a category ci  Odds ratio (OR) ─ the odds of the word occurring in the positive class normalized by that of the negative class.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Feature selection metrics (cont.)  IG and CHI are two-sided, whose values are non-negative.  CC and OR are one-sided metrics, whose positive and negative values correspond to the positive and negative features respectively  We can easily obtain that the sign for a one- sided metric, e.g. CC or OR, is sign (AD-BC). A positiveCnegative B negative D positive

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Feature selection metrics (cont.)  A one-sided metric could be converted to its two- sided counterpart by ignoring the sign, while a two- sided metric could be converted to its one-sided counterpart by recovering the sign.  We propose the two-sided counterpart of OR, namely OR square, and the one-sided counterpart of IG, namely signed IG as follows.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 The imbalanced data problem  The imbalanced data problem occurs when the training examples are unevenly distributed among different classes.  there is an overwhelming number of non-relevant training documents especially when there is a large collection of categories with each assigned to a small number of documents,  This problem presents a particular challenge to classification algorithms, which can achieve high accuracy by simply classifying every example as negative.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 The imbalanced data problem ( cont.)  The impact of imbalanced data problem on the standard feature selection can be illustrated as follows, which primarily answers the first question of “how sub-optimal are two side metric”: ─ How to confidently reject the non-relevant documents is important in that case. ─ given a two-sided metric, the values of positive features are not necessarily comparable with those of negative features.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 The imbalanced data problem ( cont.) ─ TP,FP, FN and TN denote true positives, false positives, false negatives and true negatives respectively. Positive features have more effect on TP and FN, negative features have more effect on TN and FP. feature selection using a two-sided metric combines the positive and negative features so as to optimize the accuracy, which is defined to be F1 has been widely used in information retrieval, which is: Finally, some performance measures themselves,

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Feature selection framework – General formulation  For each category c i : ─ l is the size of feature set ─ 0 < l 1 <= 1 is the key parameter of the framework to be set. ─ The function =ﮒ ( . ; c i ) should be defined so that the larger = ﮒ(t; c i ) is, the more likely the term t belongs to the category c i.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Feature selection framework – General formulation (cont.)  Obviously, one-sided metrics like SIG, CC, and OR can serve as such functions. ─ we can easily obtain: SIG(t; c i ) = - SIG(t; c i ); CC(t; c i ) = - CC(t; c i ); OR(t; c i ) = - OR(t; c i );

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Feature selection framework – General formulation (cont.)  the second step can be rewritten as:  the framework combines ─ the l 1 terms with largest ﮒ =( . ; ci) ─ the l 2 = l - l 1 terms with smallest = ﮒ ( . ; ci).

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Feature selection framework – two special cases  The standard feature selection methods generally fall into one of the following two groups: ─ select the positive features only using one-sided metrics, e.g. SIG, CC, and OR. For convenience, we will use CC as the representative of this group. ─ implicitly combine the positive and negative features using two-sided metrics, e.g. IG, CHI, and ORS. CHI will be chosen to represent this group.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Feature selection framework – two special cases (cont.)  The positive subset of F i is  ─

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Feature selection framework – Optimization  The feature selection framework facilitates the control on explicit combination of the positive and negative features through the parameter l 1 =l (size ratio).  How to optimize the size ratio? ─ Ideal scenario use the training data to learn different models per category according to different size ratios ranging from 0 to 1 and select the ratio having best performance, ─ Practical scenario we tried in this paper is to empirically select the size ratio per category having best performance on the training set.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Feature selection framework – Optimization (cont.)  The efficient implementation of optimization within the framework is as follows: ─ select l positive features with greatest ﮒ (t; ci) in a decreasing order. ─ select l negative features with smallest ﮒ (t; ci) in an increasing order. ─ empirically choose the size ratio l 1 / l such that the feature set constructed by combining the first l 1 ; 0 < l 1 < l; positive features and the first l - l 1 negative features has the optimal performance.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Experimental setup – data collection  We conduct experiments on standard text categorization data using two popular classifiers: ─ Naiive Bayes (NB) and logistic regression (LR).  Data collection ─ Reuters (ModApte split) is used as our data collection, ─ contains 90 categories, with 7769 training documents and 3019 test documents. ─ After filtering out all numbers, stop-words and words occurring less than 3 times, we have 9484 indexing words in the vocabulary. ─ Words in document titles are treated same as in document ─ body.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Experimental setup – Classifiers  A NB score between a document d and the category c i can be calculated as: ─ f j is the feature appearing in the document P(c i ) and P(c i ) represent prior probabilities of relevant and non-relevant respectively and P(f j | c i ) and P(f j | c i ) are conditional probabilities estimated with Laplacian smoothing.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26  Logistic regression tries to model the conditional probability as: ─  The optimization problem for LR is to minimize: ─ d i is the ith training example, w is the weight vector, y i Є {-1, 1} is the label associated with d i, λ is an appropriately chosen regularization parameter Experimental setup – Classifiers (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 Experimental setup – performance measure  To measure the performance, we use both precision (p) and recall (r) in their combined from F1 :. To remain compatible with other results, the F1 value at Break Even Point (BEP).  BEP is defined to be the point where precision equals recall. ─ It corresponds to the minimum of |FP – FN| in practice.  We report both micro and macro-averaged BEP F1. ─ The micro-averaged F1 is largely dependent on the most common categories while the rare categories influence macro-averaged F1.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 Experimental setup – performance measure

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 Primary results and analysis – ideal scenario  Micro F1(BEP) values for NB with the feature selection methods at different sizes of features over the 58 categories: 3rd-60th. The micro-averaged F1(BEP) for NB without feature selection (using all 9484 features ) is.641

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 Primary results and analysis – ideal scenario (cont.)  As table1 but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for NB without feature selection is.483

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 Primary results and analysis – ideal scenario (cont.)  Micro-averaged F1(BEP) values for LR with the feature selection methods at different sizes of features over the 58 categories. The micro-averaged F1(BEP) for LR without feature selection is.766

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 32 Primary results and analysis – ideal scenario (cont.)  As table 3, but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for LR without feature selection is.676

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 33 Primary results and analysis – ideal scenario (cont.)  Figure 2: Size ratios implicitly decided by using two-sided metrics: IG, CHI and ORS respectively (58 categories:3rd- 60th, feature size = 50)  confirms that feature selection using a two-sided metric is similar to its one-sided counterpart (size ratio = 1) when the feature size is small.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 34 Primary results and analysis – ideal scenario (cont.)  Figure 3: BEP F1 for test over the first two and last two categories (out of 58 categories) at different l 1 =l values (NB, ﮒ= CC, feature size = 50). The optimal size ratios for money-fx, grain, dmk and lumber are 0.95, 0.95, 0.1 and 0.2 respectively

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 35 Primary results and analysis – ideal scenario (cont.)  Figure 4: Optimal size ratios of iSIG, iCC and iOR (NB, 58 categories:3rd- 60th, feature size = 50)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 36 Primary results and analysis – ideal scenario (cont.)  Figure 5: As gure 4, but for LR

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 37 Primary results and analysis – practical scenario (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 38 Primary results and analysis – Additional results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 39 Primary results and analysis – Additional results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 40 Primary results and analysis – Additional results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 41 Conclusions  a novel feature selection framework was presented, in which the positive and negative features are separately selected and explicitly combined.  We explored three special cases of the framework: ─ 1. consider the positive features only by using one-sided metrics; ─ 2. implicitly combine the positive and negative features by using two-sided metrics; ─ 3. combine the two kinds of features explicitly and choose the size ratio empirically such that optimal performance is obtained.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 42 Conclusions (cont.)  Implicitly combining positive and negative features using two-sided metrics is not necessarily optimal, especially on imbalanced data.  A judicious combination shows great potential and practical merits.  A good feature selection method should take into consideration the data set, performance measure, and classification methods.  Feature selection can significantly improve the performance of both naiive Bayes and regularized logistic regression on imbalanced data.