Download presentation
Presentation is loading. Please wait.
1
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles
2
Project Overview Evaluation Business understanding Data understanding Data Data preparation Modelling Deployment Data Mining using Clementine Finding patterns in genetic data Supervised machine learning using C5.0 Decision Tree and Neural Networks Unsupervised machine learning using K-Means Clustering Follow the CRISP-DM methodology
3
Business and Data Understanding The term used for extracting, sorting and analysing sequence information about genes, genomes and proteins is Bioinformatics. To find patterns in genes we need a technique to select informative ones and eliminate redundancy. The data source we are using is DatasetA_12600gene.txt and Sample_Data.xls from the http://research.dfci.harvard.edu/meyersonlab/lungca/ website. This mRNA dataset sample has been successfully used, to reveal Adenocarcinoma sub-classes, in the classification of Human Lung Carcinomas, Bhattacharjee et al. (2001).http://research.dfci.harvard.edu/meyersonlab/lungca/ 12600 genes 203 lung samples including: 139 lung adenocarcinomas (AD) including 12 suspected metastases of extra pulmonary origin 21 squamous (SQ) cell carcinoma cases 20 pulmonary carcinoid (CO) tumours 6 small cell lung cancers (SM) 17 normal lung (NL) samples
4
Data Preparation 12600 genes correlated against each other and the average taken using Excel and VBA 158,760,000 correlated calculations in total ! Proportional selection of genes
5
Gene Selection
6
Modelling in Clementine
7
Evaluation Four-fold cross validation Train learning techniques on ¼ of the data and test on ¾ Each ¼ takes a turn to be trained Average the error rate Balanced sample selection (small cell lung cancers samples eliminated because they were not enough) 8 Datasets vary in size from 0.25% to 2% of genes All datasets contain 17 NL, 20 CO, 21 SQ and between 21 and 58 AD samples Supervised learning techniques best results were achieved with 2% proportionally selected genes. C5.0 Decision Tree 28% Error Rate Neural Networks 11% Error Rate High Correlation and Low Correlation selections produced much worse error rates K-Means Clustering not suitable with proportional selection of genes
8
Further Work Effect of smoking and clinical path data on the classifications. Classify BAC (bronchioloalveolar carcinoma) from genome expression profiling. Other Clementine machine learning nodes evaluated for classifying samples from genome expression profiles.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.