Download presentation
Presentation is loading. Please wait.
Published byRoger Doyle Modified over 8 years ago
1
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed Khan et al. (Summarized by Marcílio Souto – ICMC/USP- São Carlos) marcilio@icmc.usp.br
2
2 Abstract Small, round blue-cell tumors (SRBCTs) Four distinct categories hard to discriminate cDNA microarray and Artificial Neural Networks (ANNs) Tumor diagnosis and the identification of candidate targets for therapy
3
3 The Problem SRBCTs of childhood Neuroblastoma (NB) Rhabdomyosarcoma (RMS) Non Hodgkin lymphoma (NHL) The Ewing family of tumors (EWS) All four distinctions have similar appearances in routine histology Accurate diagnosis is essential In clinical practice Immunohistochemistry: the detection of protein expression Reverse transcription-PCR: tumor-specific translocation EWS-FLI1 in EWS and the PAX3-FKHR in ARMS
4
4 The Approach Gene-expression profiling using cDNA microarrays A simultaneous analysis of multiple markers Multiple categorical distinctions Artificial neural networks (ANNs) Diagnosing myocardial infarcts Diagnosing arrhythmias from electrocardiograms Interpreting radiographs Interpreting magnetic resonance images
5
5 The Experiment cDNA microarray with 6,567 genes 63 training examples Tumor biopsy material Cell lines Filtering for a minimal level of expression 2,308 genes PCA further reduced the dimensionality. 10 dominant PCA components were used. (63% of the variance in the data matrix) Three-fold cross-validation 3,750 ANNs were constructed (average vote) No overfitting and zero classification error in the training sample
6
6 Data Sets 63 Total number of samples for train and validation 0 The number of unlabeled samples 8 The number of train samples for cancer IV (BL) 12 The number of train samples for cancer III (NB) 20 The number of train samples for cancer II (RMS) 23 The number of train samples for cancer I (EWS) train Table for thetest Table I for the 25 Total number of test samples 5 The number of unlabeled samples (non-SRBCT) 3 The number of test samples for cancer IV (BL) 6 The number of test samples for cancer III (NB) 5 The number of test samples for cancer II (RMS) 6 The number of test samples for cancer I (EWS)
7
7 The Schematic View of the Analysis Process
8
8 Data Analysis Initial Cuts Principal Components Analysis Artificial Neural Network Prediction Extraction of Relevant Genes
9
9 Data Analysis: Initial Cuts and PCA Initial Cuts Gene are omitted if for any of the samples the red intensity (ri) is less than 20 From 6567 to 2308 genes Principal Components Analysis (PCA) Reduce the dimensionality of data to 10 components – 2308 genes to 10 inputs inputs This number (10) was found by means of pre- experiments
10
10 Data Analysis: Artificial Neural Network (1/3) Architecture and Parameters Linear Perceptron (LP) 10 inputs representing the PCA components 4 output nodes – one for each class of tumor (EWS, BL, NB and RMS) 44 free parameters, including four threshold units Calibration (training) was performed using JETNET =0.7; momentum=0.3 Learning rate decreased after each epoch (0.99) Initial weights randomly chosen from [-r,r] – r=0.1/F Weights updated after every 10 epochs At most 100 epochs
11
11 Data Analysis: Artificial Neural Network (2/3) Calibration and Validation 3-fold cross-validation 63 labeled samples are randomly shuffled and split into 3 equally sized groups The network is trained with two of these groups and the other used to validation This procedure is repeated 3 times The random shuffling is redone 1250 times 3750 networks For validation, the average of the result for the 1250 networks as output – committee For test samples, the committee is formed with all 3750 networks 25 samples in the test set
12
12 Data Analysis: Artificial Neural Network (3/3) Assessing the quality of classifications Each sample is classified as belonging to the cancer type corresponding to the largest average committee vote Rejection of second largest class or samples that do not belong to any of the class Definition of a distance from a sample to the ideal vote for each cancer type Based on the validation set, for each type of cancer an empirical distribution of its distance is generated For a given test sample, the system can reject possible classification based on these probability distributions OBS: the classification as well as the extraction of important genes converges using less than 100 networks The only reason 3750 networks were used is to have sufficient statistics for these empirical probability distributions
13
13 Relevant Gene Extraction In order to select relevant genes, the authors proposed a sensitivity measure (S) of the outputs (o) with respect to any of the 2308 input variables, summed over the number of samples and outputs All 3750 networks are involved They also proposed a measure related for a single output Thus, they can rank the genes according to their importance for the total classification but also according to their importance for the different disease separately They explored for 6, 12, 24, 48, 96, 192, 384, 768 and 1536 genes For each choice training (calibration) was redone
14
14 Summed Square Error Graph
15
15 Optimizations of Genes Utilized for Classification Using 3,750 trained models, rank all genes according to their significance for the classification Determine the classification error rate using increasing number of these ranked genes
16
16 Recalibrating the ANNs Using only 96 genes, the analysis process was repeated Zero classification error
17
17 Diagnostic Classification 25 test examples (5 non-SRBCTs) If a sample falls outside the 95 th percentile of the probability distribution of distances between samples and their ideal output, its diagnosis is rejected
18
18 Multi-Dimensional Scaling (MDS) Using 96 genes
19
19 Hierarchical Clustering of 96 Genes - 93 unique genes (3 IGF2 and 2 MYC) - 13 ESTs - 41 genes have not been reported as associated with these diseases. - Perfect clustering of four categories
20
20 Expression of FGFR4 on SRBCT Tissue Array At the protein level, Immunohistochemistry on SRBCT tissue arrays for the expression of fibroblast growth factor receptor 4 (FGFR4) FGFR4 Expressed during myogenesis (not in adult muscle) Potential role in tumor growth Prevention of terminal differentiation in muscle Strong cytoplasmic immunostaining for FGFR4 was seen in all 26 RMSs tested.
21
21 Discussion Current diagnoses of tumors rely on histology (morpholgy) and immunohistochemistry (protein expression) Using cDNA microarrays Multiple markers (robust) Reveal the underlying genetic aberrations or biological processes Tumors and cell lines Cell lines for ANN calibration
22
22 Reference J. Khan et al. ”Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks”, Nature Medicine, Vol. 7, Number 6, June 2001 and the references therein. Analysis Methods Supplement for Nature Medicine, Vol. 7, Number 6, June 2001. http://medicine.nature.com M. Ringner, C. Peterson and J. Khan ”Analyzing array data using supervised methods”, Pharmacogenomics, vol. 3, Number 3, 2003. NIH News Release: Gene Chips Accurately Diagnose Four Complex Childhood Cancers Artificial Intelligence Used With Gene Expression Microarrays for the First Time. http://www.nih.gov/news/pr/may2001/nhgri-30.htm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.