Classifying the Thyroid Disease

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Random Forest Predrag Radenković 3237/10
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Ensemble Learning: An Introduction
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Machine Learning: Ensemble Methods
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Ensemble Learning (2), Tree and Forest
Machine Learning CS 165B Spring 2012
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Big data classification using neural network
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CS 9633 Machine Learning Support Vector Machines
Machine Learning for Computer Security
Data Transformation: Normalization
Chapter 7. Classification and Prediction
An Artificial Intelligence Approach to Precision Oncology
School of Computer Science & Engineering
Rule Induction for Classification Using
Trees, bagging, boosting, and stacking
COMP61011 : Machine Learning Ensemble Models
Basic machine learning background with Python scikit-learn
Final Year Project Presentation --- Magic Paint Face
ID3 Algorithm.
Data Mining Practical Machine Learning Tools and Techniques
Chapter 3. Artificial Neural Networks - Introduction -
Introduction to Data Mining, 2nd Edition
COSC 4335: Other Classification Techniques
CSCI N317 Computation for Scientific Applications Unit Weka
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Ensemble learning.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Somi Jacob and Christian Bach
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 7: Transformations
A task of induction to find patterns
Data Pre-processing Lecture Notes for Chapter 2
A task of induction to find patterns
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classifying the Thyroid Disease

Introduction Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering 2018-11-17

Introduction - Objective of the Project Experience Data Mining as a part of KDD processes Focused on using various Data Mining Techniques Our objective is find a model(classifier) Estimate constructed models Used R GUI version 2.9.0 with Tinn-R version 1.17.2.4 AI Machine Learning Pattern Recognition Statistics Database Systems C = f(A) Data Mining Korea University , Industrial System Information Engineering 2018-11-17

Introduction - Project Plan 4/10 First Team meeting ~4/26 Find a exist research, data set for the project 4/28 Submit a initial Proposal 5/10 Change the subject of the project ~5/27 Try to get a suitable data set for the project 5/29 Write out a modified Proposal 6/4 Submit a modified Proposal 6/6 Decision Tree and SVM classifier modeling 6/10 Ensemble & ANN model construction 6/16 Integrate the results and Typing final report 6/18 Submit a Final Report and Presentation Korea University , Industrial System Information Engineering 2018-11-17

Data Selection Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Property of Data set Thyroid Disease Data set from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease) Attributes 29 Nominal(T/F, M/F, etc.) and Ratio Attributes Nominal attributes have text values Some highly correlated attributes Data Instances 2800 training instances which contain some missing values 972 test instances also contain some missing values Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Property of Data set Parallel Coordinate Plot Example code parallel(~hypo.data[,1:22]) There are too many attributes to analysis correlation between attributes and classes Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Property of Data set Parallel Coordinate Plot Example code attach(hypo.data) parallel(~hypo.data + [,c(1,2,17,18,19,20,21,22)] + | Diagnosis, + groups=Diagnosis)) According to this, attribute FTI, TT4 may classify primary and compensated hypothyroid Korea University , Industrial System Information Engineering

Data Selection - Data Preprocessing Dimensionality Reduction Control Anomaly/Missing Values Attribute Transformation Eliminate highly correlated attributes Select meaningful attributes Replace these with estimated values Text values to integer values Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Data Preprocessing Dimensionality Reduction (29 attributes to 22) For each instance, attribute TSH, T3, TT4, T4U, FTI have unknowns when the values of each measured are FALSE Replace unknowns with zero e.g) If a value of TSH measured is FALSE then a value of TSH is unknown ; TSH measured has high correlation with TSH Each measured is meaningless attribute Values of TBG measured are all FALSE, moreover TBG values are all unknown also ID : Nominal Attribute which is worth to identify uniqueness of instance DELETE ATTRIBUTES DELETE ATTRIBUTES DELETE ATTRIBUTES Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Data Preprocessing Anomaly It is supposed to input the value of age 45 or 55 Replace 455 to 50 Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Data Preprocessing Missing Value We decide to choose some patients who are similar to the patient missed Age value. Finally, we chose 2 patients using Excel then replaced missed age value with a mean of 2 values Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Data Preprocessing Missing Value Replaced with all possible values with prob. distribution (1:2) Korea University , Industrial System Information Engineering 2018-11-17

Data Selection - Data Preprocessing Attribute Transformation All of Nominal Attributes except SEX have TRUE/FALSE values Transform these text values to integer values 0(FALSE) and 1(TRUE) Attribute SEX has MALE/FEMALE values, also text values Transform to integer values 1(MALE) and 2(FEMALE) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches to Classify the Thyroid Disease Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - C4.5 / C5 Method We decided to construct first classification model by using decision tree Decision Tree is a method easily building a classifier It is based Hunt’s Algorithm Measurement of the impurity of leaf nodes is Entropy Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - C4.5 / C5 Method We used Tree library to branch decision tree in RGui Example code library(tree) hypo.tree <- tree(Diagnosis ~ ., data = hypo.data) pred.tree <- predict(hypo.tree, x, type=c("class")) table(pred.tree,y) plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - C4.5 / C5 Method Cross Validation of the Decision Tree According to this result, it is estimated that an optimal model with low deviance when the number of the leaf nodes is 7 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - C4.5 / C5 Method Decision Tree Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - C4.5 / C5 Method Training Set Accuracy = 2784/2800 = 0.9943 Too low Entropy of original dataset(0.4720 ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 153 3 6 1 2573 2 4 58 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - C4.5 / C5 Method Test Set Accuracy = 968/972 = 0.9959 Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 40 2 1 898 30 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Support Vector Machine(SVM) Support Vector Machine(SVM) is a efficient model to classify instances by finding linear or non-linear hyper plane It is suitable model when data set has multi dimension Very hard to visualize all data instances with many attributes, however two attributes with some slices, we can visualize instances include relationship between attributes Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Support Vector Machine(SVM) SVM modeling using R Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Support Vector Machine(SVM) We thought attribute FTI and TT4 are suitable to separate instances This figure shows that how attribute FTI and TT4 separate data set instances, but all of records in this area are classified as negative Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Support Vector Machine(SVM) Now, change the axis and give some slices which give us reduction of dimensions The area painted with light pink suggests that the class of instances in that area would be predicted primary hypothyroid Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Support Vector Machine(SVM) Prediction of Training Set Accuracy = 2658/2800 = 0.9493 Too low Entropy of original dataset(0.4720 ; max 2) Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 30 2 122 2579 13 1 49 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Support Vector Machine(SVM) Prediction of Test Set Accuracy = 933/972 = 0.9599 Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 11 4 29 901 6 21 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) Concept of ANN An artificial neural network, usually called “neural network” is a computational model that tries to simulate the structure and/or functional aspects of biological neural networks In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the phase In a neural network model, simple nodes are connected together to form a network of nodes Its practical use comes with algorithms designed to alter the strength(weights) of the connections in the network to produce a desired signal flow Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) Training / Test error rate According to this result, the number of hidden nodes be used in ANN would be 18 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) Construction of ANN classifier Used nnet library Example code y <- hypo.data$Diagnosis hypo.ann <- nnet(Diagnosis~., + hypo.data, size=18, + decay=5e-4, maxit=300) hypo.ann summary(hypo.ann) pred.ann <- predict(hypo.ann, + hypo.data, type="class") table(pred.ann,y) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) A 22-18-4 network with 490 weights Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) A 22-18-4 network Bias Bias Class 1 X1 Hidden 1 Class 2 ︙ ︙ Class 3 X21 Hidden 18 Class 4 X22 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) Prediction of Training Set Accuracy = 2798/2800 = 0.9993 Most high training accuracy ever than other model Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 152 2580 2 64 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Artificial Neural Networks(ANN) Prediction of Test Set Accuracy = 954/972 = 0.9815 Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 35 7 1 4 891 2 3 28 Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Bagging – Algorithm Sampling with replacement Build a classifier on each bootstrap sample As known as Bootstrap aggregation Step 1 Sampling B bootstraps from the sample with size N then construct classifier models from each bootstrap sample. Step 2 Aggregate B decision trees from step 1 Step 3 Assign class to a majority of values from step 2 Korea University , Industrial System Information Engineering 2018-11-17

Bagging - Example Code - Ensemble Methods Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Example of Majority Vote (Tree 1) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Example of Majority Vote (Tree 2) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Example of Majority Vote (Tree 3) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Example of Majority Vote (Tree 4) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Example of Majority Vote (Tree 5) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods

Various Approaches - Ensemble Methods Example of Majority Vote According to majority vote, a class of 80th instance is predicted to NEGATIVE ; it is same as actual class Korea University , Industrial System Information Engineering 2018-11-17

Bagging - Example Code - Ensemble Methods Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering 2018-11-17

Various Approaches - Ensemble Methods Bagging Accuracy = 2790/2800 = 0.9964 Secondary Hypothyroid is misclassified again Actual Class Predicted Class Compensated Hypothyroid Negative Primary Hypothyroid Secondary Hypothyroid 154 4 2572 2 64 Korea University , Industrial System Information Engineering 2018-11-17

Conclusion Conclusion Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Data Selection Property Preprocessing Introduction Objective Project Plan Korea University , Industrial System Information Engineering 2018-11-17

Korea University , Industrial System Information Engineering Conclusion It was a valuable experience to us by mining data from raw data sets Limitation of our project is that the Data set we chose has not enough distribution of classes e.g) the instances those class is secondary hypothyroid are just two Since not enough number of instances, the models we constructed are may be misclassify classes ; especially secondary hypothyroid Korea University , Industrial System Information Engineering 2018-11-17

Korea University , Industrial System Information Engineering Conclusion Data Mining techniques can be applied to pathology to diagnose disease. We can also use data mining techniques in another medical decision. Using in MRI or CT scan may be good example. Because our R programming skill is too short, we could not do what we want to perfectly. So, there are some researches which are resulted from by J.R. Quinlan. We referred to these, when branching decision trees Korea University , Industrial System Information Engineering 2018-11-17

Korea University , Industrial System Information Engineering Conclusion As comparing training error, ANN model was best classifier however comparing test error, decision tree classifies instances well The most attributes of data set we used are consisted the type of TRUE or FALSE data. Because of strength of decision tree when it treats discrete values, they are done well An Ensemble model with decision tree by using bagging method, was very accurate also, because of its majority voting rule However, the number of instance is too small and initial entropy value is too low, it was hard to classifying small class. Otherwise, ANN model only classified classes well despite of its very small size even the number of this instances is only two Korea University , Industrial System Information Engineering 2018-11-17

Korea University , Industrial System Information Engineering Conclusion To diagnose some serious diseases in pathology is very fascinating, but critical. For example, we can diagnose a patient as normal even though he/she had very critical disease like a lung cancer For this reason, we think it should be applied very huge cost to misclassify patients as normal/negative and consider not only error rate of the model but also the costs of prediction Since there are many considerations of putting costs, it is hard to estimate costs accurately. we couldn’t applied to our models Even this classifier can diagnose thyroid disease, the right of final decision in doctor Korea University , Industrial System Information Engineering 2018-11-17

Korea University , Industrial System Information Engineering Thank you Any Question? Korea University , Industrial System Information Engineering 2018-11-17