2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Discovery Challenge Gene expression datasets On behalf of Olivier Gandrillon.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Three kinds of learning
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
Active Learning for Class Imbalance Problem
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.
The Broad Institute of MIT and Harvard Classification / Prediction.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Fuzzy Machine Learning Methods for Biomedical Data Analysis
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci
Consensus Group Stable Feature Selection
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Class Imbalance in Text Classification
Data Mining and Decision Support
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
Machine Learning with Spark MLlib
Trees, bagging, boosting, and stacking
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Basic machine learning background with Python scikit-learn
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Classification of class-imbalanced data
iSRD Spam Review Detection with Imbalanced Data Distributions
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA Granular Machine Learning Methods for Biomedical Data Classification

University of Georgia 2 Outline Granular Machine Learning Granular Support Vector Machines  Basic idea, Motivation, State of the Art GSVM-RU ( Repetitive Undersampling )  Highly imbalanced classification (Data Granules) GSVM-RFE ( Recursive Feature Elimination )  High dimensional classification (Feature Granules) Conclusions

Granular Machine Learning Granular Computing (GrC) is a general computation theory for effectively using granules such as classes, clusters, subsets, groups and intervals to build an efficient computational model for complex applications with huge amounts of data, information and knowledge IEEE International Conference on Granular Computing (IEEE-GrC2006), at Georgia State University, Atlanta, May 10-12, (Dr. Vapnik: SVM+, Dr. Zadeh: Soft Computing, Dr. Smale: Mathematical Learning, Dr. Lin: GrC, etc.) GrC + Machine Learning => Granular Machine Learning. Major Challenge : Granular Data Inputs=>Granular ML=> Granular Data Outputs University of Georgia 3

University of Georgia 4 Granular Machine Learning (cont.) Our works on GML: Granular Support Vector Machines, Tang, Zhang, (2004 -). Granular Kernel Machines, Jin and Zhang, (2005-). Granular Neural Networks, Zhang and Reyaz (2000-). Major applications: binary biomedical data classification (cancer, etc.). protein secondary structure prediction. highly imbalanced biomedical data classification. Main Goal: Design GML methods to intelligently map granular data inputs to crisp/granular data outputs, and then effectively make correct decisions on data space and feature space.

University of Georgia 5 A Major Challenge (Optimal Data Granulation and Optimal Feature Granulation ) Feature 1 ……Feature m-1Feature mClass Data 1+1 Data 2 Data 3+1 …… …….…… Data n

University of Georgia 6 Outline Granular Machine Learning Granular Support Vector Machines  Basic idea, Motivation, State of the Art GSVM-RU ( Repetitive Undersampling )  Highly imbalanced classification (Data Granules) GSVM-RFE ( Recursive Feature Elimination )  High dimensional classification (Feature Granules) Conclusions

University of Georgia 7 Binary Classification Data Mining Predictive Data Modeling Classification Binary Classification Descriptive Data Modeling Regression Multi Classification

University of Georgia 8 Statistical Learning – Support Vector Machines Principles  SRM principle  Kernel functions SVMs Challenges  Non-i.i.d.  Noisy  High-dimensional  Imbalanced Vapnik, 1995

University of Georgia 9 Granular Computing Granulation  Divide-and-Conquer for a huge and complicated problem  comprehends a sequence of similar little tasks Knowledge-oriented  makes the mining algorithms more effective and/or more efficient.

University of Georgia 10 Granular SVM - basic idea Learning  Divide – Granulation Subspace-based or subset-based May overlap One granule may be the best  Conquer Any classification models, here we pick up SVMs  Aggregation Data Fusion, Info Fusion, Decision Fusion, or Knowledge Fusion Prediction

University of Georgia 11 Initiatives – GSVM – Efficiency Fast - Usually more efficient to address a sequence of subtasks than addressing the original task. Scalable - For massive data, the modeling in different granules is easy to be parallelized for HPC.

University of Georgia 12 Initiatives – GSVM – Interpretability Decision process is easy to understand SVMs and NNs are “black boxes” GSVM can extract a few rules or cases from each smaller granule for RBR (Rule-Based Reasoning) or CBR (Case--Based Reasoning).

University of Georgia 13 Initiatives – GSVM – Effectiveness (1) hybrid model by combining SVMs with other GrC-based models:  Clustering, DTs, ARs split the whole feature space into a set of subspaces.  Sampling, Bagging, Boosting split the whole dataset into a sequence of subsets.  New prior knowledge-based granulation methods  A hybrid model can combine powers from multiple models for more reliable prediction.

University of Georgia 14 Initiatives – GSVM – Effectiveness (2)  if A is helpful to correctly classify B, should be in the same granule  If C is noisy to confuse a classifier for B’s classification, should be in different granule  Then effectiveness can be improved. B C A

University of Georgia 15

University of Georgia 16 Outline Granular Machine Learning Granular Support Vector Machines  Basic idea, Motivation, State of the Art GSVM-RU ( Repetitive Undersampling )  Highly imbalanced classification (Data Granules) GSVM-RFE ( Recursive Feature Elimination )  High dimensional classification (Feature Granules) Conclusions

University of Georgia 17 Case Study: Highly Imbalanced Classification Highly skewed data distribution (100:1 or even more) Imbalance is ubiquitous Primary interest is to find rare samples.

University of Georgia 18 Effect of Highly Imbalanced Distribution the majority class pushes the “ideal” decision boundary toward the minority class. Wu, et al. 2003

University of Georgia 19 Tang and Zhang, et al. 2005d, IEEE GrC 2005 GSVM – Repetitive Undersampling

University of Georgia 20 GSVM-RU for Imbalanced Classification Target  minimize the negative effect of information loss  maximize the positive effect of data cleaning Assumption  boundary is pushed toward the minority class  a single SVM is able to extract a part of informative samples

University of Georgia 21 Granulation (Divide): Repetitive Undersampling with SVMs

University of Georgia 22 Aggregation (Conquer): Discard Old Information Granules

University of Georgia 23 Aggregation (Conquer): Combine All Information Granules

University of Georgia 24 TR(i) SVMu(i) NLSV(i) INFO(i) SVMc(i) Accuracy is improved? TR(i)=TR(i-1)-NLSV(i-1) Y N End Output INFO(i-1), SVMc(i-1) INFO(i-1) TR(1) is the original training dataset INFO(0) is the set of all positive samples in TR(1) NLSV(i) is the set of negative samples which are SVs of SVMu(i)

University of Georgia 25 Effectiveness analysis: yeast data 7-6-folds double CV G-means metric GSVM-RU with “discard” aggregation 84.2±0.7 KBA 82.2±7.1 RBF-SVM59.0± features 1484 samples 51 positive (3.44%) “discard” operation is used

University of Georgia 26 Effectiveness analysis : abalone data 7-6-folds double CV G-means metric GSVM-RU with “combine” aggregation 73.4±1.6 KBA 57.8±5.4 RBF-SVM0.0±0.0 8 features 4177 samples 32 positive (0.77 %) “combine” operation is used

University of Georgia 27 Effectiveness analysis: DMC05 online shopping behavior prediction DMC05 online shopping behavior prediction GSVM-RU ranks 19 th (1 st in the US) out of 147 overall 70 features training samples 1746 positive (5.82 %) testing samples

University of Georgia 28 Effectiveness analysis: KDDCUP04 protein homology prediction KDDCUP04 protein homology prediction GSVM-RU ranks 2 nd out of 107 overall now, ranked 1 st before. 74 features training samples 1296 positive (0.89 %) testing samples “Discard” is used: 1st, 1st, 7th, and 10th granules are used

University of Georgia 29 Efficiency analysis: As a comparison, KBA even needs more time than SVM [Wu, et al. 2003].

University of Georgia 30 GSVM-RU Summary GSVM-RU is efficient due to undersampling GSVM-RU is effective due to  reservation of informative samples and  elimination of large quantities of redundant or even noisy samples The improvement on effectiveness seems more significant if the imbalance degree is higher.  abalone dataset (0.77%)  KDDCUP 2004 protein homology prediction (0.89%) Future works  GSVM-RU + SMOTE [Chawla, et al. 2002]  Parallelized

University of Georgia 31 Outline Granular Machine Learning Granular Support Vector Machines  Basic idea, Motivation, State of the Art GSVM-RU ( Repetitive Undersampling )  Highly imbalanced classification (Data Granules) GSVM-RFE ( Recursive Feature Elimination )  High dimensional classification (Feature Granules) Conclusions

University of Georgia 32 Case study: Gene Selection and Cancer Classification on Microarray Expression Data Extremely high dimensionality  AML/ALL leukemia dataset 72 * 7129  no more than 10% relevant genes (Golub, et al. 1999) Gene selection  accurate classification  helpful for cancer study  feature selection Not i.i.d., imbalance, and noise Tasks  gene subset selection  cancer classification

University of Georgia 33 Gene ranking Informative genes: really cancer-related Redundant genes: cancer-related but there are some other informative genes functioning similarly but more significantly for cancer classification Irrelevant genes: not cancer-related and their existence do not affect cancer classification Noisy genes: not cancer-related but they have negative effects on cancer classification

University of Georgia 34 Tang, et al. IEEE BIBE 2005 GSVM-RFE Extract multiple cancer-related gene subsets for  Reliable cancer classification  Constructing gene regulation network

University of Georgia 35 Fine Granulation with Fuzzy C-Means Clustering clustering in the training samples space genes with similar expression patterns have similar functions a gene may have multiple functions

University of Georgia 36 Conquer with SVM-based Ranking Guyon, et al Lower-ranked genes are removed as redundant genes

University of Georgia 37 Aggregation with Data Fusion Pick up genes from different clusters in balance An informative gene is more possible to survive

University of Georgia 38 Flexibility In a gene regulation network, many different gene subsets may regulate cancers in different ways. different runs of GSVM-RFE extract different gene subsets. Moreover, genes that survive in multiple subsets deserve higher priority for biological study.

University of Georgia 39 Original Gene Set Relevance Indexes - based pre-filtering Relevant Gene Set Fuzzy C-Means Clustering Gene Cluster 1 Gene Cluster 2 Gene Cluster K SVM-based Gene Elimination 1 SVM-based Gene Elimination 2 SVM-based Gene Elimination K Survived Gene Set SVM-based Gene Elimination Final Gene Set N Y If # of > Nt

University of Georgia 40 Empirical Studies Comparison  S2N correlation-based algorithm (Furey, et al. 2000)  SVM-RFE algorithm (Guyon, et al. 2002)  GSVM-RFE algorithm

University of Georgia 41 Evaluation metrics  Accuracy  Sensitivity  Specificity  Area under ROC curve Bradley, 1997

University of Georgia 42 prostate cancer dataset  High-dimensionality  Non i.i.d. (so cross validation or random split is not suitable)  prepared under different biological experimental contexts  Imbalance  4761 P and 110 N if thresholds 0.5 are used for both RI metrics

University of Georgia 43 Result Statistical analysis: prostate cancer dataset

University of Georgia 44 Biological Literature Verification: prostate cancer dataset 100% leave-one-out validation accuracy on the training dataset 100% prediction accuracy on the testing dataset

University of Georgia 45 AML/ALL leukemia dataset  Non i.i.d. (so cross validation or random split is not suitable)  Two datasets are prepared under different biological experimental conditions.

University of Georgia 46 Result Statistical analysis: AML/ALL leukemia dataset

University of Georgia 47 Biological Literature Verification: AML/ALL leukemia dataset 100% leave-one-out validation accuracy on the training dataset 100% prediction accuracy on the testing dataset

University of Georgia 48 GSVM-RFE Summary (1) Relevance Index filtering  to remove most of irrelevant genes  to decrease noisy effect  to select genes in balance FCM clustering  to group genes with similar functions into clusters  to assigning a gene into multiple clusters  to extract multiple informative gene subsets SVM ranking  to remove lower-ranked redundant genes in each cluster

University of Georgia 49 GSVM-RFE Summary (2) Reliable cancer classification  Granulation Multiple compact “perfect gene subsets”  Selection bias in each single gene subset Strong decision support for further cancer study  Potentially helpful to construct gene regulation network

University of Georgia 50 Outline Granular Machine Learning Granular Support Vector Machines  Basic idea, Motivation, State of the Art GSVM-RU ( Repetitive Undersampling )  Highly imbalanced classification (Data Granules) GSVM-RFE ( Recursive Feature Elimination )  High dimensional classification (Feature Granules) Conclusions

University of Georgia 51 Conclusions Granular Machine Learning methods can be used to improve performance of biomedical data classification. Data granulation and feature granulation are important for effective biomedical data classification. Future works: (1) Data/ Feature domain granulation optimization: find optimal (or near optimal) data / feature granules. (2) Design relevant GML methods. (3) Biomedical Data Multi-classification.

University of Georgia 52 References Y.C. Tang, Y.-Q. Zhang and Z. Huang, “Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp , July-September Y.C. Tang and Y.-Q. Zhang, “Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction,” Proc. of 2006 IEEE International Conference on Granular Computing (IEEE-GrC2006), Atlanta, May 10-12, Y. C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Mining Fuzzy Association Rules from Microarray Gene Expression Data for Leukemia Classification,” Proc. of International Conference on Granular Computing (GrC-IEEE 2006), Atlanta, pp , May , Y.C. Tang, Y.C. He, Y.-Q. Zhang, Z. Huang, X.H. T. Hu and R. Sunderraman, “A Hybrid CI-Based Knowledge Discovery System on Microarray Gene Expression Data,” Proc. of 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB2005), San Diego, Nov , 2005.

University of Georgia 53 References Y.C. Tang, Y.-Q. Zhang and Z. Huang, “FCM-SVM-RFE Gene Feature Selection Algorithm for Leukemia Classification from Microarray Gene Expression Data,” Proc. of FUZZ-IEEE 2005, pp , Reno, May 22-25, Y.C. Tang and Y.-Q. Zhang, “Granular Support Vector Machines with Data Cleaning for Fast and Accurate Biomedical Binary Classification,” Proc. of IEEE-GrC 2005, pp , Beijing, July 25-27, Y.C. Tang, Y.-Q. Zhang, Z. Huang and X.H. T. Hu, “Granular SVM- RFE Gene Selection Algorithm for Reliable Prostate Cancer Classification on Microarray Expression Data,” Proc. of the Fifth IEEE Symposium on Bioinformatics & Bioengineering (BIBE 2005), Minneapolis, Oct , G. Wu and E. Y. Chang, “KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution,” IEEE Transactions on Knowledge and Data Engineering, pp , Vol. 17, No. 6, June 2005.

University of Georgia 54 Acknowledgments Yuchun Tang was supported by Molecular Basis for Disease (MBD) Doctoral Fellowship, Georgia State University. This work was supported in part by NIH under P20 GM Thank Professor Ying Xu and Dr. Huiling Chen! Thank everyone!

University of Georgia 55 Questions?