InCob 20091 A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies.

Slides:



Advertisements
Similar presentations
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Advertisements

Agency for Healthcare Research and Quality (AHRQ)
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
1 Test-Cost Sensitive Naïve Bayes Classification X. Chai, L. Deng, Q. Yang Dept. of Computer Science The Hong Kong University of Science and Technology.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Imbalanced Data Set Learning with Synthetic Examples
ROC Based Evaluation and Comparison of Classifiers for IVF Implantation Prediction Aslı Uyar, Ayşe Bener Boğaziçi University, Department of Computer Engineering,
Real-Time Odor Classification Through Sequential Bayesian Filtering Javier G. Monroy Javier Gonzalez-Jimenez
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Learning from Imbalanced Data
Active Learning for Class Imbalance Problem
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.
by B. Zadrozny and C. Elkan
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.
Kaihua Zhang Lei Zhang (PolyU, Hong Kong) Ming-Hsuan Yang (UC Merced, California, U.S.A. ) Real-Time Compressive Tracking.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Experimental Evaluation of Learning Algorithms Part 1.
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Anti-Learning Adam Kowalczyk Statistical Machine Learning NICTA, Canberra 1 National ICT Australia Limited is funded and.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Modern Topics in Multivariate Methods for Data Analysis.
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Class Imbalance Classification Implementation Group 4 WEI Lili, ZENG Gaoxiong,
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Class Imbalance in Text Classification
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
Vision-based SLAM Enhanced by Particle Swarm Optimization on the Euclidean Group Vision seminar : Dec Young Ki BAIK Computer Vision Lab.
Data Mining and Decision Support
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hybrid Classifiers for Object Classification with a Rich Background M. Osadchy, D. Keren, and B. Fadida-Specktor, ECCV 2012 Computer Vision and Video Analysis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Investigating the Effect of Sampling Methods for Imbalanced.
Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.
Preface to the special issue on context-aware recommender systems
iSRD Spam Review Detection with Imbalanced Data Distributions
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Leverage Consensus Partition for Domain-Specific Entity Coreference
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies The University of Sydney, NSW 2006, Australia NICTA, Australian Technology Park, Eveleigh NSW 2015, Australia

InCob Road map Imbalanced class distribution in medical data Sampling –Over-sampling –Under-sampling Convert feature selection techniques as sampling strategy System overview Results Conclusion

InCob Imbalanced class distribution in medical data Medical data are commonly with imbalanced class distribution Why? -Positive samples are special cases (rare) while negative samples are abundant (Reversely, only positive samples are collected) -Data contain subtypes each with limited samples subtypes

InCob Problem… Building classification model with imbalanced dataset will cause the under represented class been overlooked or even ignored. Yet, the rare classes often carry important biological implication. The difficulty becomes: how to remedy the imbalanced class distribution.

InCob Remedy Via sampling – before model building process –Over sampling: increase sample size of minority class (could introduce noise and redundancy) –Under sampling: decrease sample size of majority class (could remove representative samples) Via cost-sensitive learning – within model building process –Need to choose an appropriate cost-metric (hard to determine a prior)

InCob Current methods The most straightforward way – random over- sampling and under-sampling –Naive method but work well in different situations Clustering and sampling –Clustering dataset and sampling according to the characteristic of each cluster Synthesizing new examples –Most popular is “smote” which creates “artificial” samples to increase the size of minority class

InCob Our contribution – Proposing a novel sampling strategy Convert feature selection technique as sampling strategy –Selecting a subset of “optimal” samples from majority class Supervised sample selection (imbalanced dataset) (balanced dataset)

InCob classifierpredictionranking high low imbalanced balanced majority minority particle swarm optimization optimization test set minority sample majority sample train predict Conceptual representation

InCob Particle swarm optimization Problem encoding Each particle is a subset of samples from the majority class sample 1 sample 2 … sample m m is the sample size of majority class

InCob Final Schema

InCob Results (1) PSO achieved better classification results. classification results. (2) Different evaluation metrics could gives metrics could gives a different evaluation a different evaluation indication. indication.

InCob Results continue (3) Different classifiers also perform differently within the same sampling method

InCob Key observation The evaluation of data sampling strategy is compounded by the type of classifier applied and the evaluation metric used. Therefore, caution should be drawn when the conclusion is made on the basis of a single type of classifier or evaluation metric.

InCob Conclusion The study shows that with proper modification feature selection techniques can be applied to sampling of imbalanced data. The application of such technique to medical domain demonstrates it can help to increase the classification accuracy which is valuable to prediction or decision support systems.

InCob Publication P Yang, L Hsu, B. Zhou, Z. Zhang, A Zomaya, A particle swarm based hybrid system for imbalanced medical data sampling, accepted by BMC Genomics.

InCob Questions!