A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology 2014/10/22
Outline 2 Experimental result Microarray Datasets & Research Objective Related work & background Research method Conclusion
Microarray datasets Microarray technology can be used to measure the expression levels of thousands of genes at the same time. A microarray dataset records the gene expressions of different samples in a table. 3 Mobile Computing & Data Mining Lab.
Microarray datasets N : Number of samples (40~200) M : Num. of genes (2,000~30,000) g i,j : expression level of gene j at sampel i Class label : the class label of the sample Mobile Computing & Data Mining Lab. 4 (M >> N) M genesClass label N Samples Gene 1 Gene 2 Class S1S S ………… S j SjSj The Prostate cancer dataset : (Simplified) 0 : Absent 1 : Present
Research objective M>>N pose challenges in diagnosis (or Classification) Mobile Computing & Data Mining Lab. 5 To select a minimal subset of genes with high classification accuracy rate. A gene selection problem
Outline 6 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion
Related work Ding, C., & Peng, H. used the Pearson correlation coefficient to eliminate redundant genes from microarray datasets. Minimum redundancy feature selection from microarray gene expression data.(2003 & 2005) Yang, et al. proposed to use information gain and genetic algorithms for gene selection. IG-GA: A Hybrid Filter/Wrapper Method for Feature Selection of Microarray Data.(2010) 7 Mobile Computing & Data Mining Lab.
Related work Luo, et al. clustered genes into groups and treated genes in the same group as redundant genes. Improving the Computational Efficiency of Recursive Cluster Elimination. (2011) 8 Mobile Computing & Data Mining Lab.
Background knowledge Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree. Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples). Mobile Computing & Data Mining Lab. 9
Ecological correlation (Robinson) Ecological Correlation Divide dataset into groups, use the means of different groups to calculate the Pearson correlation coefficients. Reduce the in-group variance, increase the value of correlation coefficient between attributes. Mobile Computing & Data Mining Lab. 10
Example Leukemia1 dataset grouped by class labels (0,1,2) Cor(gene1 {μ 0, μ 1, μ 2 },gene2{μ 0, μ 1, μ 2 }) = Mobile Computing & Data Mining Lab. 11 gene1gene2class μ0μ0 μ1μ1 μ2μ2 gene gene mean
Support Vector Machine A classification method by Cortes & Vapnik(1995) To find a good hyper-plane to separate samples with different class labels. Mobile Computing & Data Mining Lab. 12 ∣ a 1 -a ∣ > |b 1 -b ∣ Hyper-plane a is better than hyper- plane b. margin Support Vectors b1b1 b b2b2 a2a2 a1a1 a
Outline 13 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion
Research method Mobile Computing & Data Mining Lab. 14
Data preprocessing - Normalization Normalize the dataset using Z-Score Z score of gene expression X ij : Where ‐ X ij : the expression gene j on sample i. ‐ : Mean of gene i’s expression over different samples ‐ S i : standard deviation of gene i’s expression over different samples. 15 Mobile Computing & Data Mining Lab.
Gene filtering by information gain 16
Gene filtering Most of the genes have their IG values equal to 0. Select the gene with IG greater than 0 for candidate genes. For example, the Leukemia1 dataset has 5,327 genes; only 263 genes left after gene filtering with IG. Mobile Computing & Data Mining Lab. 17
Grouping of gene 18
Grouping of genes Gene list and threshold of cor. Build the list of candidate genes Set threshold = 0.8 ( strong positively correlated ) Grouping method : With the first gene on the list as the basis, group the rest genes with the basis gene if their correlation coefficients is greater than 0.8. Mobile Computing & Data Mining Lab. 19 Gene IDCor. Gene 1, Gene 1, Gene 1, Gene 1, gene ID Gene 1 Gene 2 Gene 3 Gene 4 Gene 5... Build a gene list Calculate correlation coefficients
Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list. Eliminate genes from the existing group Mobile Computing & Data Mining Lab. 20 Gene ID Gene 3 Gene 4 Gene 5 Gene IDCor. Gene 1,20.83 Gene 1,30.53 Gene 1,40.32 Gene 1,50.13 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Cluster 1 :
Select the gene with the highest IG from each group. Select one gene from each group Mobile Computing & Data Mining Lab.
ANOVA : For dataset with more than three two class labels, use ANOVA to test whether the class means are all equal. Hypothesis: Gene with no different means over different class labels are eliminated. Eliminate genes with no classification capability 22 Mobile Computing & Data Mining Lab.
T-test T-test is used to test whether the class means of a gene are different. Genes with no different class means are eliminated. The significant level α is set to Eliminate genes 23 Mobile Computing & Data Mining Lab.
Subset refinement using GA 24
Subset refinement Encoding : Binary Encoding: ” 0 ”--- gene not selected; ” 1 ”---gene is selected. example : select the 2 nd, 3 rd, 6 th genes from the subset. Chromosome length: the candidate gene subset from step II. Population size=5 Number of Iteration =1,000 Mobile Computing & Data Mining Lab. 25
Subset refinement Fitness function : the accuracy rate of SVM of the chromosome. Selection method: Roulette Wheel Selection probability is in proportional to the fitness value of the chromosome Single point crossover and mutation : Crossover Rate =0.7 Mutation Rate = 0.3 Mobile Computing & Data Mining Lab. 26
Termination condition Termination condition : (any of the following) Accuracy rate = 100% # of iteration = 1,000 # of iteration is greater than 100 and the accuracy rates of the last 20 iterations are all the same. Final solution : the chromosome with the largest fitness value in the last iteration. Mobile Computing & Data Mining Lab. 27
Outline 28 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion
The datasets Mobile Computing & Data Mining Lab. 29 Data set name# of samples# of class labels# of genes 9_Tumors6095,726 Brain_Tumor19055,920 Brain_Tumor250410,367 Leukemia17235,327 Leukemia272311,225 Lung Cancer203512,600 SRBCT8342,308 11_Tumors ,533 Prostate Tumor102210,509 DLBCL7725,469 GEMS : /
Genes selected in 3 steps Mobile Computing & Data Mining Lab. 30 Data Set # of original genes IGGroupingGA 9_Tumors 5, Brain_Tumor1 5, Brain_Tumor2 10,367 3, Leukemia1 5, Leukemia2 11,225 3,09763 Lung_Cancer 12,600 3, SRBCT 2, _Tumors 12,533 3, Prostate_Tumor 10, DLBCL 5,
Compare with other paper Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA Mobile Computing & Data Mining Lab. 31 Data SetGEPUBLICPAMIG-GAHybrid 9_Tumors66.67(19)43.33 (47)85.00 (52)71.67(13) Brain_Tumor184.44(30)85.56 (42)93.33 (244)91.12(10) Brain_Tumor280.00(15)66.00 (25)88.00 (489)92.00(4) Leukemia197.22(11)93.06 (11) (82)97.23(4) Leukemia291.67(31)91.67 (52)98.61 (782)100.00(3) Lung_Cancer94.58(29)93.60 (75)95.57 (2101)97.05(18) SRBCT98.80(26)98.80 (41) (56)100.00(7) 11_Tumors86.21(87)81.61 (203)92.53 (479)91.95(255) Prostate_Tumor95.10(4)93.14 (13)96.08 (343)94.12(119) DLBCL97.40(13)80.52 (70) (107)97.40(84)
Outline 32 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion
Each step in our method effectively reduces noisy genes from its previous step. The hybrid method select fewer genes with higher classification accuracy rate. Need to further improve the hybrid method over 2-class microarray datasets. Mobile Computing & Data Mining Lab. 33
Q & A Thank you for your listening. 34
Information Gain For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed : Info A : The equivalent Info (weighted sum) of subsets of D, where D is split into subsets using attribute A : Gain(A) : 35, P i : prob. of a sample in D belongs to class i. A : {a 1,a 2,…,a v } , attr. A has v different values D : is split into {D 1,D 2,…,D v } D i : contains samples with A equal to a j
Data Mining: Concepts and Techniques Attribute Selection: Information Gain Class P: buys_computer = “yes” : 9 Class N: buys_computer = “no” : AgeincomestudentcreditBuy <=30highnofairno <=30highnoexcellentno 31…40highnofairyes >40mediumnofairyes >40lowyesfairyes >40lowyesexcellentno 31…40lowyesexcellentyes <=30mediumnofairno <=30lowyesfairyes >40mediumyesfairyes <=30mediumyesexcellentyes 31…40mediumnoexcellentyes 31…40highyesfairyes >40mediumnoexcellentno agePN <= …4040 >4032