Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On Rival Penalization Controlled Competitive Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Extreme Re-balancing for SVMs and other classifiers Presenter: Cui, Shuoyang 2005/03/02 Authors: Bhavani Raskutti & Adam Kowalczyk Telstra Croporation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Human eye sclera detection and tracking using a modified.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author: Aravind.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Empirical Study of Learning from Imbalanced Data Using.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The Evolving Tree — Analysis and Applications Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Evolving Reactive NPCs for the Real-Time Simulation Game.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Regularization in Matrix Relevance Learning Petra Schneider,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A personal route prediction system base on trajectory.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An Integrated Machine Learning Approach to Stroke Prediction Presenter: Tsai Tzung Ruei Authors: Aditya.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Prediction model building and feature selection with support.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A clustering-based approach for prediction of cardiac.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Investigating the Effect of Sampling Methods for Imbalanced.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor : Dr. Hsu Reporter : Wen-Hsiang Hu Author : Bhavani Raskutti and Adam Kowalczyk Sigkdd Explorations

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Related Research Support Vector Machines Re-balancing of the Data Sample Balancing Weight Balancing Experimental Discussion Conclusion Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  A standard recipe for two class discrimination is to take examples from both classes, then generate a model for discriminating them. However, there are many applications were obtaining examples of a second class is difficult. ─ e.g. classifying sites of “interest” to a web surfer  There are situations when the data has heavily unbalanced representatives of the two classes of interest, ─ e.g. fraud detection and information filtering

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  Get better performance by one-class learners

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Related Research (1/2)  Many solutions have been proposed to address the imbalance problem including sampling and weighting examples. ─ Typically, these methods focus on cases when the imbalance ratio of minority to majority class is around 10:90  In this paper, we focus on extreme imbalance in very high dimensional input spaces, where at the learning stage the minority class consists of around 1-3% of data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Related Research (2/2)  In both cases (image retrieval and document classification) ─ One-class models are much worse than the two-class models  In this paper, we show that for certain problems such as the gene knock-out experiments for understanding AHR( 芳香巠基碳水化合物接受器 ) signalling pathway ─ minority one-class SVMs significantly outperform models learnt using examples from both classes.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Support Vector Machines (1/4)  Given a training sequence (x i,y i ) of binary n-vectors and bipolar labels  Our aim is to find a “good” discriminating function  kernel machine:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Support Vector Machines (2/4)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Support Vector Machines (3/4)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Support Vector Machines (4/4)  If the kernel k satisfies the Mercer theorem assumptions[7;24;25] then for the minimiser of (2) we have where  We shall be using the popular polynomial kernel

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Re-balancing of the Data - Sample Balancing aaaaaa 0:1

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Re-balancing of the Data - Weight Balancing  a  The case of “balanced proportions” achieved for B= 0. B= +1 representing the case of learning from positive examples only. Similarly, learning from negative class only is achieved for B= -1. is a parameter called a balance factor

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiments- Real World Data Collections  AHR-data set used for task 2 of KDD Cup 2002 ─ 芳香巠基碳水化合物的資料集 ─ for cancer research ─ three class: change, control, nc  Reuters data ─ documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Performance Measures  We have used AROC, the Area under the Receiver Operating Characteristic (ROC) curve as our main performance measure.  The trivial uniform random predictor has AROC of 0.5, while a perfect predictor has an AROC of 1. X i from the negative class X j from the positive class

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Experiments with Real World Data  The sizes of the data split training:test were ─ 50%:50% for the Reuters data ─ 70%:30% for the AHR-data

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Impact of Regularization Constant positive 1-calss – – – – – – – balanced 2-class – ‧ – ‧ – ‧ – un-balanced 2-class …………… negative 1-class

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Experiments with Sample Balancing

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Impact of feature selection (1/2)  feature selection methods: ─ DocFreq (Document frequency thresholding): 1 ─ ChiSqua(χ 2 ): The measures the lack of independence between a feature and a class of interest. ─ MutInfo (Mutual Information) ─ InfGain (Information gain): term goodness measure  We have used all of the minority cases and sampled the majority cases at different mixture ratios (MajorityOnly sample balancing).

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Impact of feature selection (2/2) two

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Experiments with Weight Balancing  In order to understand if the impact of negative examples may be reduced using the balance factor B in Equation (4) ─ Tests on AHR data ─ Tests on Reuters

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Tests on AHR data  B= 0 : balanced 2-class  B= +1 : positive 1-class  B= -1 : negative 1-class

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Tests on Reuters balanced 2-class positive 1-class

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Experiments with Synthetic Data  S 1 : n inf =1; n noise =999  S 2 : n inf =10; n noise =990  S 3 : n inf =1; n noise =19 polynomial kernels: non-linear kernel two polynomial kernels : linear kernel

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Discussion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Conclusion  The Reuters dataset ─ provides quite good results but using both classes always produces better results  The AHR data set ─ The positive one-class learners performing significantly better than two-class learners.  One-class learning from positive class examples can be a very robust classification technique when dealing with very unbalanced data and high dimensional noisy feature space.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Personal Opinion  Strength ─ many experiments  Weakness ─ equations are not clear  Application ─ SVM document classification Image retrieval