Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classifying the Thyroid Disease

Similar presentations


Presentation on theme: "Classifying the Thyroid Disease"— Presentation transcript:

1 Classifying the Thyroid Disease

2 Introduction Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering

3 Introduction - Objective of the Project
Experience Data Mining as a part of KDD processes Focused on using various Data Mining Techniques Our objective is find a model(classifier) Estimate constructed models Used R GUI version with Tinn-R version AI Machine Learning Pattern Recognition Statistics Database Systems C = f(A) Data Mining Korea University , Industrial System Information Engineering

4 Introduction - Project Plan
4/10 First Team meeting ~4/26 Find a exist research, data set for the project 4/28 Submit a initial Proposal 5/10 Change the subject of the project ~5/27 Try to get a suitable data set for the project 5/29 Write out a modified Proposal 6/4 Submit a modified Proposal 6/6 Decision Tree and SVM classifier modeling 6/10 Ensemble & ANN model construction 6/16 Integrate the results and Typing final report 6/18 Submit a Final Report and Presentation Korea University , Industrial System Information Engineering

5 Data Selection Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering

6 Data Selection - Property of Data set
Thyroid Disease Data set from UCI Machine Learning Repository ( Attributes 29 Nominal(T/F, M/F, etc.) and Ratio Attributes Nominal attributes have text values Some highly correlated attributes Data Instances 2800 training instances which contain some missing values 972 test instances also contain some missing values Korea University , Industrial System Information Engineering

7 Data Selection - Property of Data set
Parallel Coordinate Plot Example code parallel(~hypo.data[,1:22]) There are too many attributes to analysis correlation between attributes and classes Korea University , Industrial System Information Engineering

8 Data Selection - Property of Data set
Parallel Coordinate Plot Example code attach(hypo.data) parallel(~hypo.data + [,c(1,2,17,18,19,20,21,22)] + | Diagnosis, + groups=Diagnosis)) According to this, attribute FTI, TT4 may classify primary and compensated hypothyroid Korea University , Industrial System Information Engineering

9 Data Selection - Data Preprocessing
Dimensionality Reduction Control Anomaly/Missing Values Attribute Transformation Eliminate highly correlated attributes Select meaningful attributes Replace these with estimated values Text values to integer values Korea University , Industrial System Information Engineering

10 Data Selection - Data Preprocessing
Dimensionality Reduction (29 attributes to 22) For each instance, attribute TSH, T3, TT4, T4U, FTI have unknowns when the values of each measured are FALSE Replace unknowns with zero e.g) If a value of TSH measured is FALSE then a value of TSH is unknown ; TSH measured has high correlation with TSH Each measured is meaningless attribute Values of TBG measured are all FALSE, moreover TBG values are all unknown also ID : Nominal Attribute which is worth to identify uniqueness of instance DELETE ATTRIBUTES DELETE ATTRIBUTES DELETE ATTRIBUTES Korea University , Industrial System Information Engineering

11 Data Selection - Data Preprocessing
Anomaly It is supposed to input the value of age 45 or 55 Replace 455 to 50 Korea University , Industrial System Information Engineering

12 Data Selection - Data Preprocessing
Missing Value We decide to choose some patients who are similar to the patient missed Age value. Finally, we chose 2 patients using Excel then replaced missed age value with a mean of 2 values Korea University , Industrial System Information Engineering

13 Data Selection - Data Preprocessing
Missing Value Replaced with all possible values with prob. distribution (1:2) Korea University , Industrial System Information Engineering

14 Data Selection - Data Preprocessing
Attribute Transformation All of Nominal Attributes except SEX have TRUE/FALSE values Transform these text values to integer values 0(FALSE) and 1(TRUE) Attribute SEX has MALE/FEMALE values, also text values Transform to integer values 1(MALE) and 2(FEMALE) Korea University , Industrial System Information Engineering

15 Various Approaches to Classify the Thyroid Disease
Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering

16 Various Approaches - C4.5 / C5 Method
We decided to construct first classification model by using decision tree Decision Tree is a method easily building a classifier It is based Hunt’s Algorithm Measurement of the impurity of leaf nodes is Entropy Korea University , Industrial System Information Engineering

17 Various Approaches - C4.5 / C5 Method
We used Tree library to branch decision tree in RGui Example code library(tree) hypo.tree <- tree(Diagnosis ~ ., data = hypo.data) pred.tree <- predict(hypo.tree, x, type=c("class")) table(pred.tree,y) plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7) Korea University , Industrial System Information Engineering

18 Various Approaches - C4.5 / C5 Method
Cross Validation of the Decision Tree According to this result, it is estimated that an optimal model with low deviance when the number of the leaf nodes is 7 Korea University , Industrial System Information Engineering

19 Various Approaches - C4.5 / C5 Method
Decision Tree Korea University , Industrial System Information Engineering

20 Various Approaches - C4.5 / C5 Method
Training Set Accuracy = 2784/2800 = Too low Entropy of original dataset( ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 153 3 6 1 2573 2 4 58 Korea University , Industrial System Information Engineering

21 Various Approaches - C4.5 / C5 Method
Test Set Accuracy = 968/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 40 2 1 898 30 Korea University , Industrial System Information Engineering

22 Various Approaches - Support Vector Machine(SVM)
Support Vector Machine(SVM) is a efficient model to classify instances by finding linear or non-linear hyper plane It is suitable model when data set has multi dimension Very hard to visualize all data instances with many attributes, however two attributes with some slices, we can visualize instances include relationship between attributes Korea University , Industrial System Information Engineering

23 Various Approaches - Support Vector Machine(SVM)
SVM modeling using R Korea University , Industrial System Information Engineering

24 Various Approaches - Support Vector Machine(SVM)
We thought attribute FTI and TT4 are suitable to separate instances This figure shows that how attribute FTI and TT4 separate data set instances, but all of records in this area are classified as negative Korea University , Industrial System Information Engineering

25 Various Approaches - Support Vector Machine(SVM)
Now, change the axis and give some slices which give us reduction of dimensions The area painted with light pink suggests that the class of instances in that area would be predicted primary hypothyroid Korea University , Industrial System Information Engineering

26 Various Approaches - Support Vector Machine(SVM)
Prediction of Training Set Accuracy = 2658/2800 = Too low Entropy of original dataset( ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 30 2 122 2579 13 1 49 Korea University , Industrial System Information Engineering

27 Various Approaches - Support Vector Machine(SVM)
Prediction of Test Set Accuracy = 933/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 11 4 29 901 6 21 Korea University , Industrial System Information Engineering

28 Various Approaches - Artificial Neural Networks(ANN)
Concept of ANN An artificial neural network, usually called “neural network” is a computational model that tries to simulate the structure and/or functional aspects of biological neural networks In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the phase In a neural network model, simple nodes are connected together to form a network of nodes Its practical use comes with algorithms designed to alter the strength(weights) of the connections in the network to produce a desired signal flow Korea University , Industrial System Information Engineering

29 Various Approaches - Artificial Neural Networks(ANN)
Training / Test error rate According to this result, the number of hidden nodes be used in ANN would be 18 Korea University , Industrial System Information Engineering

30 Various Approaches - Artificial Neural Networks(ANN)
Construction of ANN classifier Used nnet library Example code y <- hypo.data$Diagnosis hypo.ann <- nnet(Diagnosis~., + hypo.data, size=18, + decay=5e-4, maxit=300) hypo.ann summary(hypo.ann) pred.ann <- predict(hypo.ann, + hypo.data, type="class") table(pred.ann,y) Korea University , Industrial System Information Engineering

31 Various Approaches - Artificial Neural Networks(ANN)
A network with 490 weights Korea University , Industrial System Information Engineering

32 Various Approaches - Artificial Neural Networks(ANN)
A network Bias Bias Class 1 X1 Hidden 1 Class 2 Class 3 X21 Hidden 18 Class 4 X22 Korea University , Industrial System Information Engineering

33 Various Approaches - Artificial Neural Networks(ANN)
Prediction of Training Set Accuracy = 2798/2800 = Most high training accuracy ever than other model Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 152 2580 2 64 Korea University , Industrial System Information Engineering

34 Various Approaches - Artificial Neural Networks(ANN)
Prediction of Test Set Accuracy = 954/972 = Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 35 7 1 4 891 2 3 28 Korea University , Industrial System Information Engineering

35 Various Approaches - Ensemble Methods
Bagging – Algorithm Sampling with replacement Build a classifier on each bootstrap sample As known as Bootstrap aggregation Step 1 Sampling B bootstraps from the sample with size N then construct classifier models from each bootstrap sample. Step 2 Aggregate B decision trees from step 1 Step 3 Assign class to a majority of values from step 2 Korea University , Industrial System Information Engineering

36 Bagging - Example Code - Ensemble Methods
Korea University , Industrial System Information Engineering

37 Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 1) Korea University , Industrial System Information Engineering

38 Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 2) Korea University , Industrial System Information Engineering

39 Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 3) Korea University , Industrial System Information Engineering

40 Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 4) Korea University , Industrial System Information Engineering

41 Various Approaches - Ensemble Methods
Example of Majority Vote (Tree 5) Korea University , Industrial System Information Engineering

42 Various Approaches - Ensemble Methods

43 Various Approaches - Ensemble Methods
Example of Majority Vote According to majority vote, a class of 80th instance is predicted to NEGATIVE ; it is same as actual class Korea University , Industrial System Information Engineering

44 Bagging - Example Code - Ensemble Methods
Korea University , Industrial System Information Engineering

45 Various Approaches - Ensemble Methods
Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering

46 Various Approaches - Ensemble Methods
Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering

47 Various Approaches - Ensemble Methods
Bagging Accuracy = 2790/2800 = Secondary Hypothyroid is misclassified again Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 154 4 2572 2 64 Korea University , Industrial System Information Engineering

48 Conclusion Conclusion
Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering

49 Korea University , Industrial System Information Engineering
Conclusion It was a valuable experience to us by mining data from raw data sets Limitation of our project is that the Data set we chose has not enough distribution of classes e.g) the instances those class is secondary hypothyroid are just two Since not enough number of instances, the models we constructed are may be misclassify classes ; especially secondary hypothyroid Korea University , Industrial System Information Engineering

50 Korea University , Industrial System Information Engineering
Conclusion Data Mining techniques can be applied to pathology to diagnose disease. We can also use data mining techniques in another medical decision. Using in MRI or CT scan may be good example. Because our R programming skill is too short, we could not do what we want to perfectly. So, there are some researches which are resulted from by J.R. Quinlan. We referred to these, when branching decision trees Korea University , Industrial System Information Engineering

51 Korea University , Industrial System Information Engineering
Conclusion As comparing training error, ANN model was best classifier however comparing test error, decision tree classifies instances well The most attributes of data set we used are consisted the type of TRUE or FALSE data. Because of strength of decision tree when it treats discrete values, they are done well An Ensemble model with decision tree by using bagging method, was very accurate also, because of its majority voting rule However, the number of instance is too small and initial entropy value is too low, it was hard to classifying small class. Otherwise, ANN model only classified classes well despite of its very small size even the number of this instances is only two Korea University , Industrial System Information Engineering

52 Korea University , Industrial System Information Engineering
Conclusion To diagnose some serious diseases in pathology is very fascinating, but critical. For example, we can diagnose a patient as normal even though he/she had very critical disease like a lung cancer For this reason, we think it should be applied very huge cost to misclassify patients as normal/negative and consider not only error rate of the model but also the costs of prediction Since there are many considerations of putting costs, it is hard to estimate costs accurately. we couldn’t applied to our models Even this classifier can diagnose thyroid disease, the right of final decision in doctor Korea University , Industrial System Information Engineering

53 Korea University , Industrial System Information Engineering
Thank you Any Question? Korea University , Industrial System Information Engineering


Download ppt "Classifying the Thyroid Disease"

Similar presentations


Ads by Google