Class Imbalance in Text Classification

Name: Class Imbalance in Text Classification
Uploaded: 2017-07-31T05:37:06+00:00
Duration: PTM9S0
Channel: Eunice Simon
Description: Class Imbalance in Text Classification

Class Imbalance in Text Classification
Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum

Outline Class Imbalance Algorithms for Class Imbalance
Text Classification Feature selection for text classification Experiments Results Discussion

The Class Imbalance Problem(1)
Common problem in Machine Learning Almost all the instances belong to one major class and the rest belong to the minor class. Imbalance Level= |Majority Class|/|Minority Class|. It can be huge (order of 106). Applications detecting oil spills, text classification, fraud detection and many medical applications such as automatic diagnosis

The Class Imbalance Problem(2)
Many classification algorithms are sensitive to the imbalanced class distribution Class imbalance is taken into account in the design of new classifiers Solutions cost-sensitive learning, data resampling, feature selection.

Cost-Sensitive Algorithms
Penalties assigned to mistakes made by classification algorithms. Assign different asymmetric misclassification costs to classes. The penalty is higher when the mistake is made on the minority class, to emphasize the correct classification of minority instances. Cost- sensitive learning does not modify the class distribution

Data Resampling Learning instances in the majority class and minority class are manipulated in order to balance the class distribution. Effective but may introduce noise or remove useful information.

Data Resampling Oversampling
Duplicates the minority class for more effect on the machine learning algorithm. Might be effective but may be prone to overfitting. Variants: SMOTE (Synthetic Minority Oversampling Technique), MSMOTE (Modified SMOTE), …

Data Resampling Undersampling
Using a subset of the majority class to train the classifier. Many majority class examples are ignored so that the training set becomes more balanced and the training process becomes faster. Effective but may discard useful information. There are variants of undersampling. E.g. One-sided undersampling

Bagging/Boosting Bootstrapping is random sampling with replacement
Bagging is aggregating classifiers induced over independently drawn bootstrap samples. Boosting is to focus on difficult samples by giving a higher weight parameter

Feature Selection Feature selection is able to improve the performance of naive Bayes and regularized logistic regression on imbalanced data. The challenges of feature selection and imbalanced data classification meet when the dataset to be analyzed is of high-dimensionality and highly imbalanced class distribution

Text Classification Sorting natural language texts or documents into predefined categories based on their content. Applications automatic indexing, document organization, text filtering, hierarchical categorization of web pages, spam filtering, … Class Imbalance is common in text classification (e.g)

Feature Selection in Text Classification
Common in text classification because it can improve text classification. Select features using different metrics (TF, Chi-square, information gain) for a nearly optimal classification We can use positive/negative features Combining positive and negative features might be useful

Experiments We implemented random oversampling, Random undersampling, SMOTE, MSMOTE, One sided Undersampling. Our approach We combined feature selection and resampling by: Calculating Term Frequency Applying a resampling Algorithm Dataset Reuters Chosen Evaluation Metrics precision=tp/tp+fp , recall=tp/tp+fn, f-measure=2.recall.precision/recall+precision

Experiments Data

Experiments Random Oversampling

Experiments SMOTE

Experiments MSMOTE

Experiments Random Undersampling

Experiments One-sided Undersampling

Results(1) No feature selection Without Sampling Random oversampling
Random undersampling one sided undersampling smote msmote Precision 0.0909 0.0434 Recall 0.2380 F-Measure 0.2385 0.1315 0.0597

Results(2) 100 features selected after using TF Without Sampling
Random oversampling Random undersampling one sided undersampling smote msmote Precision 1 0.6111 0.0884 0.0851 0.5 0.5384 Recall 0.0476 0.5238 0.6190 0.7619 0.3333 F-measure 0.0909 0.5641 0.1547 0.1531 0.5116 0.4117

Results(3) 500 features selected after using TF Without Sampling
Random oversampling Random undersampling one sided undersampling smote msmote Precision 0.0476 0.1777 0.0937 0.1666 0.4 Recall 0.2857 0.5238 0.3809 0.5714 F-Measure 0.2424 0.1411 0.2528 0.2318 0.4705

Discussion Feature selection improves oversampling.
Feature selection also improves undersampling recall. Adding more features does not always improve the results.

Thank you!

Class Imbalance in Text Classification

Similar presentations

Presentation on theme: "Class Imbalance in Text Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Class Imbalance in Text Classification

Similar presentations

Presentation on theme: "Class Imbalance in Text Classification"— Presentation transcript:

Similar presentations

About project

Feedback