Download presentation
Presentation is loading. Please wait.
1
Classifying the Thyroid Disease
2
Introduction Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering
3
Introduction - Objective of the Project
Experience Data Mining as a part of KDD processes Focused on using various Data Mining Techniques Our objective is find a model(classifier) Estimate constructed models Used R GUI version with Tinn-R version AI Machine Learning Pattern Recognition Statistics Database Systems C = f(A) Data Mining Korea University , Industrial System Information Engineering
4
Introduction - Project Plan
4/10 First Team meeting ~4/26 Find a exist research, data set for the project 4/28 Submit a initial Proposal 5/10 Change the subject of the project ~5/27 Try to get a suitable data set for the project 5/29 Write out a modified Proposal 6/4 Submit a modified Proposal 6/6 Decision Tree and SVM classifier modeling 6/10 Ensemble & ANN model construction 6/16 Integrate the results and Typing final report 6/18 Submit a Final Report and Presentation Korea University , Industrial System Information Engineering
5
Data Selection Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering
6
Data Selection - Property of Data set
Thyroid Disease Data set from UCI Machine Learning Repository ( Attributes 29 Nominal(T/F, M/F, etc.) and Ratio Attributes Nominal attributes have text values Some highly correlated attributes Data Instances 2800 training instances which contain some missing values 972 test instances also contain some missing values Korea University , Industrial System Information Engineering
7
Data Selection - Property of Data set
Parallel Coordinate Plot Example code parallel(~hypo.data[,1:22]) There are too many attributes to analysis correlation between attributes and classes Korea University , Industrial System Information Engineering
8
Data Selection - Property of Data set
Parallel Coordinate Plot Example code attach(hypo.data) parallel(~hypo.data + [,c(1,2,17,18,19,20,21,22)] + | Diagnosis, + groups=Diagnosis)) According to this, attribute FTI, TT4 may classify primary and compensated hypothyroid Korea University , Industrial System Information Engineering
9
Data Selection - Data Preprocessing
Dimensionality Reduction Control Anomaly/Missing Values Attribute Transformation Eliminate highly correlated attributes Select meaningful attributes Replace these with estimated values Text values to integer values Korea University , Industrial System Information Engineering
10
Data Selection - Data Preprocessing
Dimensionality Reduction (29 attributes to 22) For each instance, attribute TSH, T3, TT4, T4U, FTI have unknowns when the values of each measured are FALSE Replace unknowns with zero e.g) If a value of TSH measured is FALSE then a value of TSH is unknown ; TSH measured has high correlation with TSH Each measured is meaningless attribute Values of TBG measured are all FALSE, moreover TBG values are all unknown also ID : Nominal Attribute which is worth to identify uniqueness of instance DELETE ATTRIBUTES DELETE ATTRIBUTES DELETE ATTRIBUTES Korea University , Industrial System Information Engineering
11
Data Selection - Data Preprocessing
Anomaly It is supposed to input the value of age 45 or 55 Replace 455 to 50 Korea University , Industrial System Information Engineering
12
Data Selection - Data Preprocessing
Missing Value We decide to choose some patients who are similar to the patient missed Age value. Finally, we chose 2 patients using Excel then replaced missed age value with a mean of 2 values Korea University , Industrial System Information Engineering
13
Data Selection - Data Preprocessing
Missing Value Replaced with all possible values with prob. distribution (1:2) Korea University , Industrial System Information Engineering
14
Data Selection - Data Preprocessing
Attribute Transformation All of Nominal Attributes except SEX have TRUE/FALSE values Transform these text values to integer values 0(FALSE) and 1(TRUE) Attribute SEX has MALE/FEMALE values, also text values Transform to integer values 1(MALE) and 2(FEMALE) Korea University , Industrial System Information Engineering
15
Various Approaches to Classify the Thyroid Disease
Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering
16
Various Approaches - C4.5 / C5 Method
We decided to construct first classification model by using decision tree Decision Tree is a method easily building a classifier It is based Hunt’s Algorithm Measurement of the impurity of leaf nodes is Entropy Korea University , Industrial System Information Engineering
17
Various Approaches - C4.5 / C5 Method
We used Tree library to branch decision tree in RGui Example code library(tree) hypo.tree <- tree(Diagnosis ~ ., data = hypo.data) pred.tree <- predict(hypo.tree, x, type=c("class")) table(pred.tree,y) plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7) Korea University , Industrial System Information Engineering
18
Various Approaches - C4.5 / C5 Method
Cross Validation of the Decision Tree According to this result, it is estimated that an optimal model with low deviance when the number of the leaf nodes is 7 Korea University , Industrial System Information Engineering
19
Various Approaches - C4.5 / C5 Method
Decision Tree Korea University , Industrial System Information Engineering
20
Various Approaches - C4.5 / C5 Method
Training Set Accuracy = 2784/2800 = Too low Entropy of original dataset( ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 153 3 6 1 2573 2 4 58 Korea University , Industrial System Information Engineering
21
Various Approaches - C4.5 / C5 Method
Test Set Accuracy = 968/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 40 2 1 898 30 Korea University , Industrial System Information Engineering
22
Various Approaches - Support Vector Machine(SVM)
Support Vector Machine(SVM) is a efficient model to classify instances by finding linear or non-linear hyper plane It is suitable model when data set has multi dimension Very hard to visualize all data instances with many attributes, however two attributes with some slices, we can visualize instances include relationship between attributes Korea University , Industrial System Information Engineering
23
Various Approaches - Support Vector Machine(SVM)
SVM modeling using R Korea University , Industrial System Information Engineering
24
Various Approaches - Support Vector Machine(SVM)
We thought attribute FTI and TT4 are suitable to separate instances This figure shows that how attribute FTI and TT4 separate data set instances, but all of records in this area are classified as negative Korea University , Industrial System Information Engineering
25
Various Approaches - Support Vector Machine(SVM)
Now, change the axis and give some slices which give us reduction of dimensions The area painted with light pink suggests that the class of instances in that area would be predicted primary hypothyroid Korea University , Industrial System Information Engineering
26
Various Approaches - Support Vector Machine(SVM)
Prediction of Training Set Accuracy = 2658/2800 = Too low Entropy of original dataset( ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 30 2 122 2579 13 1 49 Korea University , Industrial System Information Engineering
27
Various Approaches - Support Vector Machine(SVM)
Prediction of Test Set Accuracy = 933/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 11 4 29 901 6 21 Korea University , Industrial System Information Engineering
28
Various Approaches - Artificial Neural Networks(ANN)
Concept of ANN An artificial neural network, usually called “neural network” is a computational model that tries to simulate the structure and/or functional aspects of biological neural networks In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the phase In a neural network model, simple nodes are connected together to form a network of nodes Its practical use comes with algorithms designed to alter the strength(weights) of the connections in the network to produce a desired signal flow Korea University , Industrial System Information Engineering
29
Various Approaches - Artificial Neural Networks(ANN)
Training / Test error rate According to this result, the number of hidden nodes be used in ANN would be 18 Korea University , Industrial System Information Engineering
30
Various Approaches - Artificial Neural Networks(ANN)
Construction of ANN classifier Used nnet library Example code y <- hypo.data$Diagnosis hypo.ann <- nnet(Diagnosis~., + hypo.data, size=18, + decay=5e-4, maxit=300) hypo.ann summary(hypo.ann) pred.ann <- predict(hypo.ann, + hypo.data, type="class") table(pred.ann,y) Korea University , Industrial System Information Engineering
31
Various Approaches - Artificial Neural Networks(ANN)
A network with 490 weights Korea University , Industrial System Information Engineering
32
Various Approaches - Artificial Neural Networks(ANN)
A network Bias Bias Class 1 X1 Hidden 1 Class 2 ︙ ︙ Class 3 X21 Hidden 18 Class 4 X22 Korea University , Industrial System Information Engineering
33
Various Approaches - Artificial Neural Networks(ANN)
Prediction of Training Set Accuracy = 2798/2800 = Most high training accuracy ever than other model Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 152 2580 2 64 Korea University , Industrial System Information Engineering
34
Various Approaches - Artificial Neural Networks(ANN)
Prediction of Test Set Accuracy = 954/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 35 7 1 4 891 2 3 28 Korea University , Industrial System Information Engineering
35
Various Approaches - Ensemble Methods
Bagging – Algorithm Sampling with replacement Build a classifier on each bootstrap sample As known as Bootstrap aggregation Step 1 Sampling B bootstraps from the sample with size N then construct classifier models from each bootstrap sample. Step 2 Aggregate B decision trees from step 1 Step 3 Assign class to a majority of values from step 2 Korea University , Industrial System Information Engineering
36
Bagging - Example Code - Ensemble Methods
Korea University , Industrial System Information Engineering
37
Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 1) Korea University , Industrial System Information Engineering
38
Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 2) Korea University , Industrial System Information Engineering
39
Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 3) Korea University , Industrial System Information Engineering
40
Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 4) Korea University , Industrial System Information Engineering
41
Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 5) Korea University , Industrial System Information Engineering
42
Various Approaches - Ensemble Methods
43
Various Approaches - Ensemble Methods
Example of Majority Vote According to majority vote, a class of 80th instance is predicted to NEGATIVE ; it is same as actual class Korea University , Industrial System Information Engineering
44
Bagging - Example Code - Ensemble Methods
Korea University , Industrial System Information Engineering
45
Various Approaches - Ensemble Methods
Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering
46
Various Approaches - Ensemble Methods
Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering
47
Various Approaches - Ensemble Methods
Bagging Accuracy = 2790/2800 = Secondary Hypothyroid is misclassified again Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 154 4 2572 2 64 Korea University , Industrial System Information Engineering
48
Conclusion Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering
49
Korea University , Industrial System Information Engineering
Conclusion It was a valuable experience to us by mining data from raw data sets Limitation of our project is that the Data set we chose has not enough distribution of classes e.g) the instances those class is secondary hypothyroid are just two Since not enough number of instances, the models we constructed are may be misclassify classes ; especially secondary hypothyroid Korea University , Industrial System Information Engineering
50
Korea University , Industrial System Information Engineering
Conclusion Data Mining techniques can be applied to pathology to diagnose disease. We can also use data mining techniques in another medical decision. Using in MRI or CT scan may be good example. Because our R programming skill is too short, we could not do what we want to perfectly. So, there are some researches which are resulted from by J.R. Quinlan. We referred to these, when branching decision trees Korea University , Industrial System Information Engineering
51
Korea University , Industrial System Information Engineering
Conclusion As comparing training error, ANN model was best classifier however comparing test error, decision tree classifies instances well The most attributes of data set we used are consisted the type of TRUE or FALSE data. Because of strength of decision tree when it treats discrete values, they are done well An Ensemble model with decision tree by using bagging method, was very accurate also, because of its majority voting rule However, the number of instance is too small and initial entropy value is too low, it was hard to classifying small class. Otherwise, ANN model only classified classes well despite of its very small size even the number of this instances is only two Korea University , Industrial System Information Engineering
52
Korea University , Industrial System Information Engineering
Conclusion To diagnose some serious diseases in pathology is very fascinating, but critical. For example, we can diagnose a patient as normal even though he/she had very critical disease like a lung cancer For this reason, we think it should be applied very huge cost to misclassify patients as normal/negative and consider not only error rate of the model but also the costs of prediction Since there are many considerations of putting costs, it is hard to estimate costs accurately. we couldn’t applied to our models Even this classifier can diagnose thyroid disease, the right of final decision in doctor Korea University , Industrial System Information Engineering
53
Korea University , Industrial System Information Engineering
Thank you Any Question? Korea University , Industrial System Information Engineering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.