Classification of multiple cancer types by multicategory support vector machines using gene expression data
Support Vector Machine A classification method which successfully diagnosis cancer problems A classification method which successfully diagnosis cancer problems Two types Two types Binary SVM: optimal extension to more than two classes not seen therefore limitation on its application to multiple tumor types Binary SVM: optimal extension to more than two classes not seen therefore limitation on its application to multiple tumor types Multicategory SVM: (recently proposed) Demonstrated on leukemia data and small round blue cells of childhood tumor. Multicategory SVM: (recently proposed) Demonstrated on leukemia data and small round blue cells of childhood tumor.
DNA microarray techonology This method measures the relative amount of mRNA in isolated cells or biosped tissues This method measures the relative amount of mRNA in isolated cells or biosped tissues Uses SVM, solves a series of binary problems- DAG SVM algorithm Uses SVM, solves a series of binary problems- DAG SVM algorithm MSVM is applied to two gene expression data sets MSVM is applied to two gene expression data sets
Features Effectiveness Effectiveness Prediction strength Prediction strength Effect of data preprocessing Effect of data preprocessing Gene selection Gene selection Dimension reduction Dimension reduction
Binary SVM
MSVM
Procedure- 3 class problem Gene expression was monitored for classification of 2 leukemias ALL acute lymphoblastic leukemia) and AML ( acute myeloid leukemia) Gene expression was monitored for classification of 2 leukemias ALL acute lymphoblastic leukemia) and AML ( acute myeloid leukemia) ALL ALL B-cell B-cell T-cell T-cell
Procedure conc. Number of genes 7129 Number of genes samples- training set 38 samples- training set 34 samples- test set 34 samples- test set Preprocessing steps performed Preprocessing steps performed Thresholding(floor-100, ceiling 16000) Thresholding(floor-100, ceiling 16000) Filtering of genes (max/min <= 5 and max- min< =500) Filtering of genes (max/min <= 5 and max- min< =500) Base 10 logarithmic transformation Base 10 logarithmic transformation
Procedure conc. Standardization of each variable Standardization of each variable Variable selection Variable selection Prescreening measure – ratio of between classes sum of squares to within class sum of squares for each gene( largest ratios taken) Prescreening measure – ratio of between classes sum of squares to within class sum of squares for each gene( largest ratios taken)
Heat Map of 40 most important genes in training set
Small round blue cell tumors data (SRBCTs) 4 types 4 types Neuroblastoma (NB) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) Rhabdomyosarcoma (RMS) Non Hodgkin lymphoma (NHL) Non Hodgkin lymphoma (NHL) Ewing family of tumors ( EWS) Ewing family of tumors ( EWS)
Used Artificial Neural Networks (ANN) Used Artificial Neural Networks (ANN) Training set – 63 samples Training set – 63 samples Test set – 20 samples Test set – 20 samples Nearest Neighbor, weighted voting, linear SVM was applied to data Nearest Neighbor, weighted voting, linear SVM was applied to data MSVM was applied for comparison MSVM was applied for comparison Logarithm base 10 of expression levels Logarithm base 10 of expression levels
Predicted decision vectors
SANN For multiclass classification For multiclass classification Classification results superior to ANN Classification results superior to ANN ANN uses back propagation algorithm ANN uses back propagation algorithm Why ? Why ? Non linear connections Non linear connections Inclusion of interactions within independent variables input) Inclusion of interactions within independent variables input) Independence from conventional processes Independence from conventional processes
Limitations Learned knowledge is contained 100’s-1000’s weights (synapses) Learned knowledge is contained 100’s-1000’s weights (synapses) Cannot be analyzed in a single regression formula Cannot be analyzed in a single regression formula
Combining several ANNs Through ensembles of networks Through ensembles of networks An ensemble: collection of finite number of different classifiers Cascading ANNs Cascading ANNs
Two level ANN Two level ANN Task : Chest Radiograms Task : Chest Radiograms Lung Nodules( Class A) Lung Nodules( Class A) Without Lung Nodules( Class B) Without Lung Nodules( Class B)
Two level architecture carrying lower level and higher level concepts Task: differentiate (higher level) Task: differentiate (higher level) Normal cells (class A) Normal cells (class A) From malignant cells (class B) (lower level) From malignant cells (class B) (lower level) Class B_1 Class B_1 Class B_2 Class B_2 Class B_3 Class B_3 Class B_4 Class B_4
One vs. all Used with SVM Used with SVM K binary classes- distinguish one class from all lumped together K binary classes- distinguish one class from all lumped together Sample assigned to classifier achieving greatest output activity Sample assigned to classifier achieving greatest output activity
ALL Pairs approach Builds K(K-1)/2 Binary classifiers Builds K(K-1)/2 Binary classifiers K-1 binary classifiers distinguish from other classifiers K-1 binary classifiers distinguish from other classifiers Output activities summed up –class with greatest activity is the winning class Output activities summed up –class with greatest activity is the winning class
SANN Oriented to human decision making Oriented to human decision making Exclusion performed- preferences narrowed down Exclusion performed- preferences narrowed down Classification made by first ANN is a preselection for second successive ANN Classification made by first ANN is a preselection for second successive ANN
References 3Dec02.pdf 3Dec02.pdf 3Dec02.pdf 3Dec02.pdf