Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen 04-29-2009.

Slides:



Advertisements
Similar presentations
A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples Dell Zhang (BBK) and Wee Sun Lee (NUS)
Advertisements

PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Introduction to Supervised Machine Learning Concepts PRESENTED BY B. Barla Cambazoglu February 21, 2014.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Chapter 5: Partially-Supervised Learning
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Distributed Representations of Sentences and Documents
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
Imbalanced Data Set Learning with Synthetic Examples
Introduction to Data Mining Engineering Group in ACL.
InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies.
Active Learning for Class Imbalance Problem
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
by B. Zadrozny and C. Elkan
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Learning with AdaBoost
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Class Imbalance Classification Implementation Group 4 WEI Lili, ZENG Gaoxiong,
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Class Imbalance in Text Classification
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Classification using Co-Training
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Big data classification using neural network
Partially Supervised Classification of Text Documents
Semi-Supervised Clustering
Chapter 8: Semi-Supervised Learning
Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1
Restricted Boltzmann Machines for Classification
Source: Procedia Computer Science(2015)70:
PEBL: Web Page Classification without Negative Examples
Finding Clusters within a Class to Improve Classification Accuracy
Prepared by: Mahmoud Rafeek Al-Farra
iSRD Spam Review Detection with Imbalanced Data Distributions
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Semi-Supervised Time Series Classification
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
CAMCOS Report Day December 9th, 2015 San Jose State University
A task of induction to find patterns
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen

Outline Introduction and Problem statement 2008 UC San Diego Data Ming Competition Task 1: Supervised Learning from Imbalanced Data Sets Over-sampling and Under-sampling Task 2: Semi-Supervised Learning from Only Positive and Unlabeled Data Two-step Strategy

Statement of Problems 2008 UC San Diego Data Ming Competition Task 1:Standard Binary Classification A binary classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly ten times as many negative examples as positive. The test set, however, is evenly distributed between positive and negative examples. Task 2:Positive-Only Semi-Supervised Task also a binary classification task, but most of the training examples are unlabeled. In fact, only a few of the positive examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. This class distribution is reflected in the test sets.

Task 1: Learning from Imbalanced Data Class imbalance is prevalent in many applications: fraud/intrusion detection, risk management, text classification, medical diagnosis/monitoring, etc. Standard classifiers tend to be overwhelmed by the large classes and ignore the small ones, i.e., tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class

Solutions to Class Imbalance Problem At the data level (re-samplings)  Over-sampling: increases the number of minority instances by over-sampling them  Under-sampling: extract a smaller set of majority instances while preserving all the minority instances At the algorithmic level  Cost-sensitive based: adjust the costs of the various classes so as to counter the class imbalance  ……

Over-sampling SMOTE: Synthetic Minority Over-sampling Technique The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Over-sampling by duplicating the minority examples

Under-sampling Randomly select a subset from the majority class. The size of the subset is roughly equal to the size of minority class. After re-sampling, apply standard classifiers onto the rebalanced datasets, compare the accuracies. Decision Tree, Naïve Bayes, Neural Network(one hidden layer)

Results for Task 1 regularUSOSbDSMOTE DT NB NN For Neural Network Classifiers, I experimented with different hidden units (5,11,15, 20), 11 gives the best accuracies.

My Ranking (52 th /199) ……

Conclusion for Task 1 For Naïve Bayes classifiers, re-sampling does not improve the accuracy significantly. For Decision Tree Classifiers, random under-sampling and over-sampling with SMOTE significantly improve the accuracy. For Neural Network, all three re-sampling techniques significantly improve the accuracy Neural Network classifier with over-sampling with SMOTE gives the best accuracy compared to other classifiers and re- sampling techniques.

Task 2: Learning from Only Positive and Unlabeled Data Positive examples: One has a set of examples of a class P, and Unlabeled set: also has a set U of unlabeled (or mixed) examples with instances from P and also not from P (negative examples). Build a classifier: Build a classifier to classify the examples in U and/or future (test) data. Key feature of the problem: no labeled negative training data. We call this problem, PU-learning.

Examples in Real Life Specialized molecular biology database. Defines a set of positive examples ( genes/proteins related to certain disease or function ) No info about examples that should not be included and it is unnatural to build such set. Learning user’s preference for web pages: – The user’s bookmarks can be considered as positive examples – All the rest web pages are unlabeled examples Direct marketing: company’s current list of customers as positive examples Text classification: labeling is labor intensive

Are Unlabeled Examples Helpful? Function known to be either x 1 0 Which one is it? x 1 < 0 x 2 > u u u u u u u u u u u “Not learnable” with only positive examples. However, addition of unlabeled examples makes it learnable.

Two-step strategy Step 1: Identifying a set of reliable negative examples from the unlabeled set. – S-EM [Liu et al, 2002] uses a Spy technique, – PEBL [Yu et al, 2002] uses a 1-DNF technique – Roc-SVM [Li & Liu, 2003] uses the Rocchio algorithm. – … Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier. – S-EM uses the Expectation Maximization (EM) algorithm, with an error based classifier selection mechanism – PEBL uses SVM, and gives the classifier at convergence. I.e., no classifier selection. – Roc-SVM uses SVM with a heuristic method for selecting the final classifier.

Step 1 Step 2 positivenegative Reliable Negative (RN) Q =U - RN U P positive Using P, RN and Q to build the final classifier iteratively or Using only P and RN to build a classifier

Step 1: The Spy technique Sample a certain % of positive examples and put them into unlabeled set to act as “spies”. Run a classification algorithm assuming all unlabeled examples are negative, – We will know the behavior of those actual positive examples in the unlabeled set through the “spies”. – Use Expectation-Maximization (EM) algorithm to assign each unlabeled example a probabilistic class label We can then extract reliable negative examples from the unlabeled set more accurately.

Step 1: The Spy technique

Step 2: Building the final classifier Use Naïve Bayes classifiers to build the final classifier Use P as the positive class, use N (reliable negative examples) as the negative class

Results and Conclusion for Task 2 Use P as positive class, use U as the negative class, use SMOTE to over-sample P so that the size of P is roughly the same as U, the F1 score = Two-step algorithm gives F1 score = The highest score is F1=0.721 Only positive and unlabeled data is learnable with the two-step strategy.

Future Work For task 1, we can try Cost-sensitive based method For task 2, two-step strategy – Step 1: 1-DNF, Rocchio algorithm – Step2: SVM

References B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179–188, B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia. Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son Lam Phung: A supervised learning approach for imbalanced data sets. ICPR 2008: 1-4 Giang Hoang NguyenSon Lam PhungICPR 2008 Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1): 1-6 (2004)Nitesh V. ChawlaNathalie JapkowiczSIGKDD Explorations 6 Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. Vol.16, pp