Classifying the Thyroid Disease

Classifying the Thyroid Disease

Introduction Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering

Introduction - Objective of the Project
Experience Data Mining as a part of KDD processes Focused on using various Data Mining Techniques Our objective is find a model(classifier) Estimate constructed models Used R GUI version with Tinn-R version AI Machine Learning Pattern Recognition Statistics Database Systems C = f(A) Data Mining Korea University , Industrial System Information Engineering

Introduction - Project Plan
4/10 First Team meeting ~4/26 Find a exist research, data set for the project 4/28 Submit a initial Proposal 5/10 Change the subject of the project ~5/27 Try to get a suitable data set for the project 5/29 Write out a modified Proposal 6/4 Submit a modified Proposal 6/6 Decision Tree and SVM classifier modeling 6/10 Ensemble & ANN model construction 6/16 Integrate the results and Typing final report 6/18 Submit a Final Report and Presentation Korea University , Industrial System Information Engineering

Data Selection Conclusion

Data Selection - Property of Data set
Thyroid Disease Data set from UCI Machine Learning Repository ( Attributes 29 Nominal(T/F, M/F, etc.) and Ratio Attributes Nominal attributes have text values Some highly correlated attributes Data Instances 2800 training instances which contain some missing values 972 test instances also contain some missing values Korea University , Industrial System Information Engineering

Parallel Coordinate Plot Example code parallel(~hypo.data[,1:22]) There are too many attributes to analysis correlation between attributes and classes Korea University , Industrial System Information Engineering

Parallel Coordinate Plot Example code attach(hypo.data) parallel(~hypo.data + [,c(1,2,17,18,19,20,21,22)] + | Diagnosis, + groups=Diagnosis)) According to this, attribute FTI, TT4 may classify primary and compensated hypothyroid Korea University , Industrial System Information Engineering

Data Selection - Data Preprocessing
Dimensionality Reduction Control Anomaly/Missing Values Attribute Transformation Eliminate highly correlated attributes Select meaningful attributes Replace these with estimated values Text values to integer values Korea University , Industrial System Information Engineering

Dimensionality Reduction (29 attributes to 22) For each instance, attribute TSH, T3, TT4, T4U, FTI have unknowns when the values of each measured are FALSE Replace unknowns with zero e.g) If a value of TSH measured is FALSE then a value of TSH is unknown ; TSH measured has high correlation with TSH Each measured is meaningless attribute Values of TBG measured are all FALSE, moreover TBG values are all unknown also ID : Nominal Attribute which is worth to identify uniqueness of instance DELETE ATTRIBUTES DELETE ATTRIBUTES DELETE ATTRIBUTES Korea University , Industrial System Information Engineering

Anomaly It is supposed to input the value of age 45 or 55 Replace 455 to 50 Korea University , Industrial System Information Engineering

Missing Value We decide to choose some patients who are similar to the patient missed Age value. Finally, we chose 2 patients using Excel then replaced missed age value with a mean of 2 values Korea University , Industrial System Information Engineering

Missing Value Replaced with all possible values with prob. distribution (1:2) Korea University , Industrial System Information Engineering

Attribute Transformation All of Nominal Attributes except SEX have TRUE/FALSE values Transform these text values to integer values 0(FALSE) and 1(TRUE) Attribute SEX has MALE/FEMALE values, also text values Transform to integer values 1(MALE) and 2(FEMALE) Korea University , Industrial System Information Engineering

Various Approaches to Classify the Thyroid Disease
Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering

Various Approaches - C4.5 / C5 Method
We decided to construct first classification model by using decision tree Decision Tree is a method easily building a classifier It is based Hunt’s Algorithm Measurement of the impurity of leaf nodes is Entropy Korea University , Industrial System Information Engineering

We used Tree library to branch decision tree in RGui Example code library(tree) hypo.tree <- tree(Diagnosis ~ ., data = hypo.data) pred.tree <- predict(hypo.tree, x, type=c("class")) table(pred.tree,y) plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7) Korea University , Industrial System Information Engineering

Cross Validation of the Decision Tree According to this result, it is estimated that an optimal model with low deviance when the number of the leaf nodes is 7 Korea University , Industrial System Information Engineering

Decision Tree Korea University , Industrial System Information Engineering

Training Set Accuracy = 2784/2800 = Too low Entropy of original dataset( ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 153 3 6 1 2573 2 4 58 Korea University , Industrial System Information Engineering

Test Set Accuracy = 968/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 40 2 1 898 30 Korea University , Industrial System Information Engineering

Various Approaches - Support Vector Machine(SVM)
Support Vector Machine(SVM) is a efficient model to classify instances by finding linear or non-linear hyper plane It is suitable model when data set has multi dimension Very hard to visualize all data instances with many attributes, however two attributes with some slices, we can visualize instances include relationship between attributes Korea University , Industrial System Information Engineering

SVM modeling using R Korea University , Industrial System Information Engineering

We thought attribute FTI and TT4 are suitable to separate instances This figure shows that how attribute FTI and TT4 separate data set instances, but all of records in this area are classified as negative Korea University , Industrial System Information Engineering

Now, change the axis and give some slices which give us reduction of dimensions The area painted with light pink suggests that the class of instances in that area would be predicted primary hypothyroid Korea University , Industrial System Information Engineering

Prediction of Training Set Accuracy = 2658/2800 = Too low Entropy of original dataset( ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 30 2 122 2579 13 1 49 Korea University , Industrial System Information Engineering

Prediction of Test Set Accuracy = 933/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 11 4 29 901 6 21 Korea University , Industrial System Information Engineering

Various Approaches - Artificial Neural Networks(ANN)
Concept of ANN An artificial neural network, usually called “neural network” is a computational model that tries to simulate the structure and/or functional aspects of biological neural networks In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the phase In a neural network model, simple nodes are connected together to form a network of nodes Its practical use comes with algorithms designed to alter the strength(weights) of the connections in the network to produce a desired signal flow Korea University , Industrial System Information Engineering

Training / Test error rate According to this result, the number of hidden nodes be used in ANN would be 18 Korea University , Industrial System Information Engineering

Construction of ANN classifier Used nnet library Example code y <- hypo.data$Diagnosis hypo.ann <- nnet(Diagnosis~., + hypo.data, size=18, + decay=5e-4, maxit=300) hypo.ann summary(hypo.ann) pred.ann <- predict(hypo.ann, + hypo.data, type="class") table(pred.ann,y) Korea University , Industrial System Information Engineering

A network with 490 weights Korea University , Industrial System Information Engineering

A network Bias Bias Class 1 X1 Hidden 1 Class 2 ︙ ︙ Class 3 X21 Hidden 18 Class 4 X22 Korea University , Industrial System Information Engineering

Prediction of Training Set Accuracy = 2798/2800 = Most high training accuracy ever than other model Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 152 2580 2 64 Korea University , Industrial System Information Engineering

Prediction of Test Set Accuracy = 954/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 35 7 1 4 891 2 3 28 Korea University , Industrial System Information Engineering

Various Approaches - Ensemble Methods
Bagging – Algorithm Sampling with replacement Build a classifier on each bootstrap sample As known as Bootstrap aggregation Step 1 Sampling B bootstraps from the sample with size N then construct classifier models from each bootstrap sample. Step 2 Aggregate B decision trees from step 1 Step 3 Assign class to a majority of values from step 2 Korea University , Industrial System Information Engineering

Bagging - Example Code - Ensemble Methods
Korea University , Industrial System Information Engineering

Example of Majority Vote (Tree 1) Korea University , Industrial System Information Engineering

Example of Majority Vote According to majority vote, a class of 80th instance is predicted to NEGATIVE ; it is same as actual class Korea University , Industrial System Information Engineering

Bagging - Example Code - Ensemble Methods
Korea University , Industrial System Information Engineering

Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering

Bagging Accuracy = 2790/2800 = Secondary Hypothyroid is misclassified again Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 154 4 2572 2 64 Korea University , Industrial System Information Engineering

Conclusion Conclusion

Korea University , Industrial System Information Engineering
Conclusion It was a valuable experience to us by mining data from raw data sets Limitation of our project is that the Data set we chose has not enough distribution of classes e.g) the instances those class is secondary hypothyroid are just two Since not enough number of instances, the models we constructed are may be misclassify classes ; especially secondary hypothyroid Korea University , Industrial System Information Engineering

Conclusion Data Mining techniques can be applied to pathology to diagnose disease. We can also use data mining techniques in another medical decision. Using in MRI or CT scan may be good example. Because our R programming skill is too short, we could not do what we want to perfectly. So, there are some researches which are resulted from by J.R. Quinlan. We referred to these, when branching decision trees Korea University , Industrial System Information Engineering

Conclusion As comparing training error, ANN model was best classifier however comparing test error, decision tree classifies instances well The most attributes of data set we used are consisted the type of TRUE or FALSE data. Because of strength of decision tree when it treats discrete values, they are done well An Ensemble model with decision tree by using bagging method, was very accurate also, because of its majority voting rule However, the number of instance is too small and initial entropy value is too low, it was hard to classifying small class. Otherwise, ANN model only classified classes well despite of its very small size even the number of this instances is only two Korea University , Industrial System Information Engineering

Conclusion To diagnose some serious diseases in pathology is very fascinating, but critical. For example, we can diagnose a patient as normal even though he/she had very critical disease like a lung cancer For this reason, we think it should be applied very huge cost to misclassify patients as normal/negative and consider not only error rate of the model but also the costs of prediction Since there are many considerations of putting costs, it is hard to estimate costs accurately. we couldn’t applied to our models Even this classifier can diagnose thyroid disease, the right of final decision in doctor Korea University , Industrial System Information Engineering

Thank you Any Question? Korea University , Industrial System Information Engineering

Classifying the Thyroid Disease

Similar presentations

Presentation on theme: "Classifying the Thyroid Disease"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classifying the Thyroid Disease

Similar presentations

Presentation on theme: "Classifying the Thyroid Disease"— Presentation transcript:

Similar presentations

About project

Feedback