Download presentation
Presentation is loading. Please wait.
Published byΦωτινή Θεοδωρίδης Modified over 5 years ago
1
Predicting Breast Cancer Diagnosis From Fine-Needle Aspiration
STAT 295: Statistical Learning Professor Richard Single Marc Amoroso and Braden McKallagat
2
Cell Nuclei Image Analysis
FNA is a commonly used diagnostic technique involving sampling cells from a lump or mass found in the body and examining their physical features under a microscope in order to diagnosis cancer and other inflammatory conditions. It presents a much less invasive way of diagnosing breast cancer, which may often involve having to perform surgery to remove a portion of the mass. Having confidence that this procedure will be as effective as others in predicting malignant or benign tumors helps to increase its usage and contributes to the wellbeing on the patient. (Notes about SE’s on the next slide) Image courtesy of: Sizilio, Glaucia & Leite, Cicilia & Mg Guerreiro, Ana & Neto, Adriao Duarte. (2012). Fuzzy method for pre-diagnosis of breast cancer from the Fine Needle Aspirate analysis. Biomedical engineering online / X
3
Dataset Breast masses from 569 patients of the University of Wisconsin Hospital Measurements of physical aspects of cell nuclei Diagnosis: “M” for malignant, “B” for benign Class distribution: 357 benign, 212 malignant Validation set: 70/30 split Used 70/30 validation set split, we wanted to leave enough in the validation set to be useful.
4
Mean-Worst Correlation
Feature Selection Mean-Worst Correlation radius texture perimeter area smoothness compactness concavity concave.point symmetry fractal_dimension Radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension Mean, SE, worst (mean of 3 greatest measurements) Chose mean from these predictors: measurements of same image Initially misinterpreted SE as a random measurement error so didn’t think it was not very useful for modeling. Worst very correlated with mean, so we went with mean measurements When interpreting these features as the mean of 3 measurements on the same cell nuclei, the normality assumption held because they are physical measurements which are centered around a true measurement. Initially didn’t recognize that the standard errors were comparing measurements across different nuclei, we thought they were random measurement errors from multiple measurements on the same nuclei, so removed them from the analysis all together. We now realize that part of what identifies benign vs malignant cells is the symmetry and uniformity or lack thereof present in the cell nuclei of the patient. Limiting to only the mean measurements didn’t appear to affect our analysis too much.
5
We then started analyzing the mean features in our training set and identified that there was most likely correlation between measurements. As you can see, radius, perimeter and area and highly correlated, which makes intuitive sense because they are included in calculations of one and other, e.g. area depends on radius and vice versa, and since area was less correlated with other predictors, we kept area.
6
Not included: radius and perimeter (correlation with area) concave points (correlation with concavity and many others) Compactness and Concavity also correlated Red = Malignant Black = Benign Here we see the 2 way plots for each variable, red is malignant and black is benign. You can see the observations are fairly interspersed in the two-dimensions, not necessarily a best linear boundary. We did notice that concavity and compactness were correlated, but it was lower than some of the other correlations present so we decided to keep both because we already got rid of the concave points measurement.
7
Decision Tree 13 terminal nodes Reduced to four variables: concavity, area, texture, and smoothness Training error: 3.015% Test error: 11.12%. Our decision tree only considered concavity, area, texture and smoothness. The training error suggests some overfitting of the model and a test error of we were hoping to improve that using additional techniques.
8
Cross Validation of Decision Tree
6 nodes Considered concavity, area, and texture Training error rate increased to 4.774% Reduced test misclassification error rate to 9.94% Using cross validation with default of 10 folds and misclassification to guide the pruning, we came up with a 6 node tree, which didn’t consider smoothness for any of the splits as it did previously. Training error rate on the pruned tree increased to 4.774% but the prediction rate on the validation set improved to 9.94%. We continued to try and improve the prediction rate.
9
Bagging and Importance
500 trees OOB error rate of 4.77% Validation error rate same as pruned tree of 9.94% Using bagging, we grew 500 trees, Out of bag error rate was the same as the pruned tree, and the validation error was also the same as the pruned tree. With regards to GINI index, concavity had the highest importance, but for Accuracy area was the most important. From here we moved onto Random forest method.
10
Random Forest and Importance
500 trees, 2 variables per split OOB error rate of 5.03% Validation error rate improved to 8.187% For random forest, we again used 500 trees, we found that 2 variables per split resulted in the best prediction accuracy, with an out of bag error rate of 5.03% but an improved on the validation set error rate to 8.187% Importance for random forest had similar results for the top 2 variables, but as you move further down there were some order changes, compactness had a higher importance in the random forest method than in the bagging ,and texture was less important in random forest method.
11
Training and OOB Error Plots
Here we see plots for each method with the training out of bag error rates for each method and the class specific rates for both malignant and benign. As the more trees are grown the more consistent the error estimates get in general. As expected the error rates for the less frequent class(M) are higher than B.
12
Boosting Results Boosting with 10-fold CV Highest CV accuracy with 150 iterations and tree depth of 2, but this did not improve on our previous CV error rate results. Just showing the boosting results using 10 fold CV. these are training error rates, not being applied to the validation set, so the best boosting method only has a training error rate of 6.54% which was worse than previous methods. Did not know how to apply the boosted tree to the validation set.
13
Linear Discriminant Model
All 7 predictors Training error: 7.035% Test error: 8.187% Linear Discriminant Model This slide shows the lda splits and the training split after LDA fitted to the training set. Test error rate of 8.187% is the same as previous random forest method.
14
This side shows the spread of the data in the validation set before and after fitting the LDA predictions. You can see the observations that are misclassified that change color in the bottom plot. Based on the early visualizations of the data we predicted that QDA might work better in classifying so went on to trying that.
15
Next Slide shows actual training vs. test splits
QDA Training error: 4.77% Testing error: 4.68% There are the splits for each of the specific two variable combinations. Next Slide shows actual training vs. test splits
16
Actual classes in the QDA test split
17
Predicted class in the QDA test split.
18
Results Recap Model Training error Test error Base Decision Tree
3.015% 11.12% CV Decision Tree 9.94% Bagging 4.77% Random Forest 5.03% 8.187% LDA 7.035% QDA 4.68% Overall best results were the QDA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.