A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.

Slides:

Advertisements

Similar presentations

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

Advertisements

Random Forest Predrag Radenković 3237/10

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe

Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,

Yue Han and Lei Yu Binghamton University.

Minimum Redundancy and Maximum Relevance Feature Selection

The Painter’s Feature Selection for Gene Expression Data Lyon, August 2007 Daniele Apiletti, Elena Baralis, Giulia Bruno, Alessandro Fiori.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.

Mutual Information Mathematical Biology Seminar

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Reduced Support Vector Machine

Ensemble Learning: An Introduction

Lecture 5 (Classification with Decision Trees)

Clustering (Part II) 11/26/07. Spectral Clustering.

Lecture 9: One Way ANOVA Between Subjects

Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.

16 November, 2005 Statistics in HEP, Manchester 1.

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.

Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,

1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

Gene-Markers Representation for Microarray Data Integration Boston, October 2007 Elena Baralis, Elisa Ficarra, Alessandro Fiori, Enrico Macii Department.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci

Consensus Group Stable Feature Selection

Analyzing Expression Data: Clustering and Stats Chapter 16.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

GENETIC ALGORITHM Basic Algorithm begin set time t = 0;

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Decision Tree. Classification Databases are rich with hidden information that can be used for making intelligent decisions. Classification is a form of.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

Hybrid Ant Colony Optimization-Support Vector Machine using Weighted Ranking for Feature Selection and Classification.

In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania

Rule Induction for Classification Using

An Enhanced Support Vector Machine Model for Intrusion Detection

Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee

Density-Based Image Vector Quantization Using a Genetic Algorithm

©Jiawei Han and Micheline Kamber

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology 2014/10/22

Outline 2 Experimental result Microarray Datasets & Research Objective Related work & background Research method Conclusion

Microarray datasets  Microarray technology can be used to measure the expression levels of thousands of genes at the same time.  A microarray dataset records the gene expressions of different samples in a table. 3 Mobile Computing & Data Mining Lab.

Microarray datasets  N ： Number of samples (40~200)  M ： Num. of genes (2,000~30,000)  g i,j ： expression level of gene j at sampel i  Class label ： the class label of the sample Mobile Computing & Data Mining Lab. 4 (M >> N) M genesClass label N Samples Gene 1 Gene 2 Class S1S S ………… S j SjSj  The Prostate cancer dataset ： (Simplified) 0 ： Absent 1 ： Present

Research objective  M>>N pose challenges in diagnosis (or Classification) Mobile Computing & Data Mining Lab. 5 To select a minimal subset of genes with high classification accuracy rate. A gene selection problem

Outline 6 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion

Related work  Ding, C., & Peng, H. used the Pearson correlation coefficient to eliminate redundant genes from microarray datasets.  Minimum redundancy feature selection from microarray gene expression data.(2003 & 2005)  Yang, et al. proposed to use information gain and genetic algorithms for gene selection.  IG-GA: A Hybrid Filter/Wrapper Method for Feature Selection of Microarray Data.(2010) 7 Mobile Computing & Data Mining Lab.

Related work  Luo, et al. clustered genes into groups and treated genes in the same group as redundant genes.  Improving the Computational Efficiency of Recursive Cluster Elimination. (2011) 8 Mobile Computing & Data Mining Lab.

Background knowledge  Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree.  Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples). Mobile Computing & Data Mining Lab. 9

Ecological correlation (Robinson)  Ecological Correlation  Divide dataset into groups, use the means of different groups to calculate the Pearson correlation coefficients.  Reduce the in-group variance, increase the value of correlation coefficient between attributes. Mobile Computing & Data Mining Lab. 10

Example  Leukemia1 dataset grouped by class labels (0,1,2)  Cor(gene1 {μ 0, μ 1, μ 2 },gene2{μ 0, μ 1, μ 2 }) = Mobile Computing & Data Mining Lab. 11 gene1gene2class μ0μ0 μ1μ1 μ2μ2 gene gene mean

Support Vector Machine  A classification method by Cortes & Vapnik(1995)  To find a good hyper-plane to separate samples with different class labels. Mobile Computing & Data Mining Lab. 12  ∣ a 1 -a ∣ > |b 1 -b ∣  Hyper-plane a is better than hyper- plane b. margin Support Vectors b1b1 b b2b2 a2a2 a1a1 a

Outline 13 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion

Research method Mobile Computing & Data Mining Lab. 14

Data preprocessing － Normalization  Normalize the dataset using Z-Score  Z score of gene expression X ij :  Where ‐ X ij ： the expression gene j on sample i. ‐ ： Mean of gene i’s expression over different samples ‐ S i ： standard deviation of gene i’s expression over different samples. 15 Mobile Computing & Data Mining Lab.

Gene filtering by information gain 16

Gene filtering  Most of the genes have their IG values equal to 0.  Select the gene with IG greater than 0 for candidate genes.  For example, the Leukemia1 dataset has 5,327 genes; only 263 genes left after gene filtering with IG. Mobile Computing & Data Mining Lab. 17

Grouping of gene 18

Grouping of genes  Gene list and threshold of cor.  Build the list of candidate genes  Set threshold = 0.8 （ strong positively correlated ）  Grouping method ：  With the first gene on the list as the basis, group the rest genes with the basis gene if their correlation coefficients is greater than 0.8. Mobile Computing & Data Mining Lab. 19 Gene IDCor. Gene 1, Gene 1, Gene 1, Gene 1, gene ID Gene 1 Gene 2 Gene 3 Gene 4 Gene 5... Build a gene list Calculate correlation coefficients

 Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list. Eliminate genes from the existing group Mobile Computing & Data Mining Lab. 20 Gene ID Gene 3 Gene 4 Gene 5 Gene IDCor. Gene 1,20.83 Gene 1,30.53 Gene 1,40.32 Gene 1,50.13 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Cluster 1 :

 Select the gene with the highest IG from each group. Select one gene from each group Mobile Computing & Data Mining Lab.

 ANOVA ：  For dataset with more than three two class labels, use ANOVA to test whether the class means are all equal. Hypothesis:  Gene with no different means over different class labels are eliminated. Eliminate genes with no classification capability 22 Mobile Computing & Data Mining Lab.

 T-test  T-test is used to test whether the class means of a gene are different.  Genes with no different class means are eliminated.  The significant level α is set to Eliminate genes 23 Mobile Computing & Data Mining Lab.

Subset refinement using GA 24

Subset refinement  Encoding ：  Binary Encoding: ” 0 ”--- gene not selected; ” 1 ”---gene is selected. example ： select the 2 nd, 3 rd, 6 th genes from the subset.  Chromosome length: the candidate gene subset from step II.  Population size=5  Number of Iteration =1,000 Mobile Computing & Data Mining Lab. 25

Subset refinement  Fitness function ： the accuracy rate of SVM of the chromosome.  Selection method: Roulette Wheel Selection probability is in proportional to the fitness value of the chromosome  Single point crossover and mutation ： Crossover Rate =0.7 Mutation Rate = 0.3 Mobile Computing & Data Mining Lab. 26

Termination condition  Termination condition ： (any of the following) Accuracy rate = 100% # of iteration = 1,000 # of iteration is greater than 100 and the accuracy rates of the last 20 iterations are all the same.  Final solution ： the chromosome with the largest fitness value in the last iteration. Mobile Computing & Data Mining Lab. 27

Outline 28 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion

The datasets Mobile Computing & Data Mining Lab. 29 Data set name# of samples# of class labels# of genes 9_Tumors6095,726 Brain_Tumor19055,920 Brain_Tumor250410,367 Leukemia17235,327 Leukemia272311,225 Lung Cancer203512,600 SRBCT8342,308 11_Tumors ,533 Prostate Tumor102210,509 DLBCL7725,469 GEMS ： /

Genes selected in 3 steps Mobile Computing & Data Mining Lab. 30 Data Set # of original genes IGGroupingGA 9_Tumors 5, Brain_Tumor1 5, Brain_Tumor2 10,367 3, Leukemia1 5, Leukemia2 11,225 3,09763 Lung_Cancer 12,600 3, SRBCT 2, _Tumors 12,533 3, Prostate_Tumor 10, DLBCL 5,

Compare with other paper  Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA Mobile Computing & Data Mining Lab. 31 Data SetGEPUBLICPAMIG-GAHybrid 9_Tumors66.67(19)43.33 (47)85.00 (52)71.67(13) Brain_Tumor184.44(30)85.56 (42)93.33 (244)91.12(10) Brain_Tumor280.00(15)66.00 (25)88.00 (489)92.00(4) Leukemia197.22(11)93.06 (11) (82)97.23(4) Leukemia291.67(31)91.67 (52)98.61 (782)100.00(3) Lung_Cancer94.58(29)93.60 (75)95.57 (2101)97.05(18) SRBCT98.80(26)98.80 (41) (56)100.00(7) 11_Tumors86.21(87)81.61 (203)92.53 (479)91.95(255) Prostate_Tumor95.10(4)93.14 (13)96.08 (343)94.12(119) DLBCL97.40(13)80.52 (70) (107)97.40(84)

Outline 32 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion

 Each step in our method effectively reduces noisy genes from its previous step.  The hybrid method select fewer genes with higher classification accuracy rate.  Need to further improve the hybrid method over 2-class microarray datasets. Mobile Computing & Data Mining Lab. 33

Q & A Thank you for your listening. 34

Information Gain  For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed ：  Info A ： The equivalent Info (weighted sum) of subsets of D, where D is split into subsets using attribute A ：  Gain(A) ： 35, P i ： prob. of a sample in D belongs to class i. A ： {a 1,a 2,…,a v } ， attr. A has v different values D ： is split into {D 1,D 2,…,D v } D i ： contains samples with A equal to a j

Data Mining: Concepts and Techniques Attribute Selection: Information Gain Class P: buys_computer = “yes” ： 9 Class N: buys_computer = “no” ： AgeincomestudentcreditBuy <=30highnofairno <=30highnoexcellentno 31…40highnofairyes >40mediumnofairyes >40lowyesfairyes >40lowyesexcellentno 31…40lowyesexcellentyes <=30mediumnofairno <=30lowyesfairyes >40mediumyesfairyes <=30mediumyesexcellentyes 31…40mediumnoexcellentyes 31…40highyesfairyes >40mediumnoexcellentno agePN <= …4040 >4032