ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.

Slides:



Advertisements
Similar presentations
Data Mining Tools Overview Business Intelligence for Managers.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
A New Algorithm of Fuzzy Clustering for Data with Uncertainties: Fuzzy c-Means for Data with Tolerance Defined as Hyper-rectangles ENDO Yasunori MIYAMOTO.
_ Rough Sets. Basic Concepts of Rough Sets _ Information/Decision Systems (Tables) _ Indiscernibility _ Set Approximation _ Reducts and Core _ Rough Membership.
Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.
Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Huge Raw Data Cleaning Data Condensation Dimensionality Reduction Data Wrapping/ Description Machine Learning Classification Clustering Rule Generation.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Amir Hossein Momeni Azandaryani Course : IDS Advisor : Dr. Shajari 26 May 2008.
A Classification Approach for Effective Noninvasive Diagnosis of Coronary Artery Disease Advisor: 黃三益 教授 Student: 李建祥 D 楊宗憲 D 張珀銀 D
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
CS Instance Based Learning1 Instance Based Learning.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
CHURN PREDICTION MODEL IN RETAIL BANKING USING FUZZY C- MEANS CLUSTERING Džulijana Popović Consumer Finance, Zagrebačka banka d.d. Consumer Finance, Zagrebačka.
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Basic Data Mining Technique
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
Reducing the Response Time for Data Warehouse Queries Using Rough Set Theory By Mahmoud Mohamed Al-Bouraie Yasser Fouad Mahmoud Hassan Wesam Fathy Jasser.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Decision Tree Algorithms Rule Based Suitable for automatic generation.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Introduction of Fuzzy Inference Systems By Kuentai Chen.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Modeling of Core Protection Calculator System Software February 28, 2005 Kim, Sung Ho Kim, Sung Ho.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
1 Context-aware Data Mining using Ontologies Sachin Singh, Pravin Vajirkar, and Yugyung Lee Springer-Verlag Berlin Heidelberg 2003, pp Reporter:
Antara Ghosh Jignashu Parikh
Big data classification using neural network
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Data Mining K-means Algorithm
Medical Diagnosis via Genetic Programming
Group 7 • Shing • Gueye • Thakur
Revision (Part II) Ke Chen
Revision (Part II) Ke Chen
Presentation transcript:

ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan

Purpose: Warehouse medical databases: Clinical databases: have accumulated large quantities of information about patients and their medical conditions Warehouse these databases and to analyze the patient ’ s condition: we need an efficient data mining technique. Data Mining Process: Data warehousing, data query and cleaning, and data analysis.

Three major data mining Techniques Regression Clustering Classification

Techniques used in this paper Two phase: Clustering and Classification First phase: To Use Rough Set Theory for Clustering ( clustering technique will reduce the complexity of the RST result) Second phase: Using Fuzzy Logic to classify the result of the clusters. Rough Set Theory (RST): Cluster Fuzzy logic: Classification Definition of Clustering: A kind of data mining techniques for warehousing the heterogeneous database. And it is used to group data that have similar characteristics in the same cluster and also group the data that have dissimilar characteristics with other clusters. (used to handle uncertainty and incomplete information)

Previous clustering techniques : K-Means Expectation Maximization Association Rule K-Prototype Fuzzy K-Modes etc.

Phase 1 – Clustering Definition: Partition data into groups of similar categories or objects. Cluster: The group in the same category or object. Different Clusters: Each of the categories in clusters is similar between them and is dissimilar to the categories of other groups. Fewer Number of Cluster: 1) Lose: Lose data details; 2) Benefit: Simplification. The search for the clusters Unsupervised Learning Clusters Type: 1. Exclusive Clusters: Any categories or objects belong to only one cluster. 2. Overlapping Clusters: Category or an object may belong to many clusters. 3. Probabilistic Clusters: A category or an object belongs to each cluster with a certain probability.

Notations in Rough Set Theory(RST) Definition 1:- Indiscernibility Relation: IND (B) Definition 2:- Equivalence Class: [ x i ] IND(B) Definition 3:- Lower Approximation: Definition 4:- Upper Approximation: Definition 5:- Roughness: Definition 6:- Mean Roughness Definition 7:- Standard Deviation

1) Whole Data Set -> Parent Node U 2) Current Number of Data Set: - >CNC( iterated from 1-K) 3)A attributes, Find the attributes have in the same category 4)Calculate the Roughness of these attributes of this category. 5)Found the mean value of all these attributes 6)Calculate and Store the Standard Deviation of these attributes 7) The smaller standard deviation is used for next iteration 8) If the Standard deviation does not match the smaller value, the next smaller value is taken as the splitting attribute. 9) Perform binary splitting: split the whole dataset into two clusters 9) Use Distance of Relevance formula to select the cluster(which have largest distance)

Phase 2 – Classification Fuzzy Inference: Generating a mapping from a given input to an output using fuzzy logic. Then, the mapping gives a basis, from which decisions can be generated or patterns discerned. Fuzzy Inference System : 1) Fuzzification 2) Fuzzy Rules Generation 3) Defuzzification Fuzzy Inference Process: 1) Membership Functions 2 ) Logical Operations 3 ) If-Then Rules

Fuzzification Conditions 1. All the “ Cluster 1 (C - 1) ” values are compared with “ Minimum Limit Value ( ML (C - 1) ) “. If any values of Cluster 1 values are less than the value ML, then those values are set as L. 2. All the “ Cluster 1 (C - 1) ” values are compared with “ Maximum Limit Value ( XL (C - 1) ) “. If any values of Cluster 1 values are less than the value XL (C - 1), then those values are set as H. (C - 1) 3. If any values of “ Cluster1(C -1 ) ” values are greater than the value ML,and less than the value XL (C - 1), then those values are set as M. Similarly, make the conditions for other cluster C - 2 also for generating fuzzy values.

Fuzzy Rules Generation General form of Fuzzy Rule: “ IF A THEN B ” IF:antecedent THEN:conclusion The output values between L and H of the FIS is trained for generating the Fuzzy Rules. According to the fuzzy values for each feature that are generated in the Fuzzification process, the Fuzzy Rules are also generated.

Defuzzification Input: The fuzzy set Output : A single number with value L, M or H (represents whether the given input dataset is in the Low range, Medium range or in the High range.) The FIS is trained with the use of the Fuzzy Rules and the testing process is done with the help of datasets.

Evaluation metrics Sensitivity Sensitivity measures the proportion of actual positives which are correctly identified. It relates to the test ‟ s ability to identify positive results. Specificity: Measures the proportion of negatives which are correctly identified. It relates to the ability of the test to identify negative results. Accuracy From the above results, we can easily get the accuracy value using the following formula, Evaluate the effectiveness of the proposed systems Justify theoretical and practical developments of these systems

Results and Discussions The paper used the heart disease data sets: Cleveland, Hungarian and Switzerland Total Number of Attributes: 76 Generally used 14 attributes: Age, sex, chest pain type, resting blood pressure,serum cholesterol in mg/dl, fasting blood sugar, resting electro-cardiographic results, maximum heart rate achieved, exercise induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels, thal and diagnosis of heart disease.

Clustering Results The dataset are clustered into two sets. Red dots->Cluster 1 Blue dots-> Cluster 2 Cross-> Centroids

Cleveland dataset Graph for the sensitivity, sensitivity and accuracy of Cleveland dataset Performance evaluation for sensitivity, specificity and accuracy of Cleveland dataset Iteratio No Sensitivit (in %) Specificit (in %) Accuracy (in %)

Iteratio No Sensitivit (in %) Specificit (in %) Accurac (in %) Switzerland dataset Performance evaluation for sensitivity, specificity and accuracy of Switzerland dataset Graph for the sensitivity, sensitivity and accuracy of Switzerland dataset

Hungarian Dataset Iteratio No Sensitivit (in %) Specificit (in %) Accurac (in %) Graph for the sensitivity, sensitivity and accuracy of Hungarian dataset Performance evaluation for sensitivity, specificity and accuracy of Hungarian

Conclusion The Switzerland dataset has provided better result, in compared with the other two datasets. At the highest iteration level, we could achieved good clustering and classification results. Rough Set Theory was used as clustering algorithm Fuzzy logic was used to classify the clusters. The experimentation was carried out on heart disease datasets The evaluation metrics of sensitivity, specificity and accuracy for the proposed work was also analyzed. Result :

Reference: [1] R.SARAVANA KUMAR, “ ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES ” , [2] Duo Chen, Du-Wu Cui, Chao-Xue Wang, and Zhu-Rong Wang, "A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data", International Journal of Information Technology, Vol.12, No.3, pp , 2006