Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesole b,c and Nicola Ancona a a)Istituto di Studi sui Sistemi Intelligenti per l’Automazione, CNR, Via Amendola 122/D-I, Bari, Italy, b)Istituto di Tecnologie Biomediche-Sezione di Bari,CNR, Via Amendola 122/D, Bari Italy c)Dipartimento Scienze Biomolecolari e Biotecnologie, Università di Milano, Via Caloria 26, Milano, Italy Abstract SVM[1] are the state-of-the-art supervised learning techniques for cancer classification. Other machine learning approaches such as RLS[2] classifiers may represent highly suitable alternative for their simplicity and reliability. We compared the performances of the RLS classifiers with SVM on three different benchmark data sets, also with respect to the number of selected genes and different gene selection strategies. We show that RLS classifiers have performances comparable to SVM classifiers expressed in terms of the LOO-error. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to the number of training examples. Moreover RLS machines allow to get an exact measure of the LOO error with just one training. Benchmark Data set description Leukemia data set [3]. 25 examples of Acute Myeloid Leukemia (AML) vs 47 examples of Acute Lymphoblastic one (ALL), divided into training and test set; Each sample consists of 7129 human gene expression levels (see Colon data set [4]. 40 examples of Tumor Colon tissue vs 22 Normal Colon tissue samples. Each sample consists of 2000 human gene expression levels (see Multi-cancer data set [5]. 190 examples relative to Cancer tissues, spanning 14 common tumor types, vs 90 Normal tissue samples; each example consists of the expression levels of genes (see SVMRLS LOO error on Leukemia training set22 Leukemia test error33 LOO error on Leukemia data set12 LOO error on Colon data set89 LOO error on Multi-Cancer data set8890 RLS computes the LOO error in just one training by using all the training exmples GENE SELECTION strategies Two techniques are used to rank the genes and a not parametric permutation test is used to determine how many genes are really important for classifying a given specimen: 999 genes in the Leukemia data set, 500 in the Colon one and 1400 in the Multi-Cancer one. S2N StatisticNRFE Statistic with j=1, 2, …., number of genes Visualization of the Statistic S2N 47 examples ALL25 examples AML HP HN Observed T S2N (j) distribution computed on the Leukemia data set compared to randomly permutated class distinctions. S2N Statistic LeukemiaColonMulti-Cancer genesSVMRLSgenesSVMRLSgenesSVMRLS NRFE Statistic LeukemiaColonMulti-Cancer genesSVMRLSgenesSVMRLSgenesSVMRLS Conclusions The RLS classifiers have performances comparable to the ones of SVM classifiers for the problem of cancer classification by gene expression data and are a valuable alternative to SVM because they enjoy several interesting properties. RLS machines are fast and easy to implement and, more important, they allow to measure the exact LOO error performing one training only. References [1] Vapnik, V. Statistical Learning Theory, John Wiley & Sons, INC.,1998. [2] Tikhonov, A.N. Arsenin, V. Y. Solutions of ill-posed problems, W.H. Winston Washington D.C., 1977 [3]Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caliguri, M.A., Bloomfield, C.D., Lander, E.S., (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286, [4]Alon,U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.(1999) Broad patterns of gene expression revealed by clustering analysis of tumor and colon tissues probed by oligonucleotide arrays, PNAS, 96, [5]Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R. (2001) Multi-class cancer diagnosis using tumor gene expression signatures PNAS, 98,