A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
AtherEx: an Expert System for Atherosclerosis Risk Assessment Petr Berka, Vladimír Laš University of Economics, Prague Marie Tomečková Institute of Computer.
Chapter 16 Parallel Data Mining 16.1From DB to DW to DM 16.2Data Mining: A Brief Overview 16.3Parallel Association Rules 16.4Parallel Sequential Patterns.
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
March 25, 2004Columbia University1 Machine Learning with Weka Lokesh S. Shrestha.
U.S. SENATE BILL CLASSIFICATION & VOTE PREDICTION Alessandra Paulino Rick Pocklington Serhat Selcuk Bucak.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
Article by: Farshad Hakimpour, Andreas Geppert Article Summary by Mark Vickers.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
CS Instance Based Learning1 Instance Based Learning.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
Enterprise systems infrastructure and architecture DT211 4
Contributed by Yizhou Sun 2008 An Introduction to WEKA.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Inductive learning Simplest form: learn a function from examples
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
1ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino.
Analysis of Death Causes in the STULONG Data Set Jan Burian, Jan Rauch EuroMISE – Cardio University of Economics Prague.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
HASAR : Mining Sequential Association Rules for Atherosclerosis Risk Factor Analysis Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Appendix D: Application of Genetic Algorithm in Classification Duong Tuan Anh 5/2014.
ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Summary „Data mining” Vietnam national university in Hanoi, College of technology, Feb.2006.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Use of Machine Learning in Chemoinformatics
1 SSC 2006: Case Study #2: Obstructive Sleep Apnea Rachel Chu, Shuyu Fan, Kimberly Fernandes, and Jesse Raffa Department of Statistics, University of British.
1 Mining Episode Rules in STULONG dataset N. Méger 1, C. Leschi 1, N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS Université d’Orsay – LRI.
Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
SDS-Rules and Classification Tomáš Karban ECML/PKDD 2003 – Dubrovnik (Cavtat) September 22, 2003.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
An Introduction to WEKA
DATA MINING © Prentice Hall.
Waikato Environment for Knowledge Analysis
Research Areas Christoph F. Eick
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Market Basket Analysis and Association Rules
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
Basics of ML Rohan Suri.
Presentation transcript:

A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo Computer Science Research Center of Lens (CRIL) CNRS - Université d’Artois – IUT de Lens Discovery Challenge (PKDD 2004)

2 Goal What are the relations between social factors (social characteristics) and the other characteristics of men in the respective groups?

3 Overview Discovery process Techniques and results –Clustering –Classification –Association rules Conclusion and further work

4 Discovery Process Hypothesis on data –ENTRY table –Groups provided by expert Merging groups 1 and 2 : Normal group Merging groups 3 and 4 : Risk group Ignoring group 6 –Characteristics Considering previous work of LRI ML research team at previous PKDD Challenges

5 Discovery Process Can we find a model that fits with the provided groups ? Are there strong similarities among instances of different groups ? Which kind of relations exist among group characteritics ?

6 Discovery Process DataTasksKnowledge Clustering Generated clusters vs provided ones Entry data groups Supervised classification Similarities among instances, and groups Association rules search Affinity among groups characteristics

7 Techniques and Results : Clustering Goal: do the initials groups can be considered as they were defined? Data : groups 12, 34 and 5 Clustering systems (WEKA package) : –COBWEB: 2 groups –EM: 4 groups –KMEANS: 2 groups Results: difficulty to identify properties which allow to retrieve the initial groups

8 Techniques and Results : Supervised Classification Risk group patients similar to those in Normal or Pathological group ? Data : –Training set : group 12 and group 5 –Test set : group 34 (Risk) System (WEKA package) : –Decision tree C4.5

9 Techniques and Results : Supervised Classification Training results: –HT descriptor are one of the most relevant factors of the disease –Thirdteen instances of Pathological group are classified as Normal === Confusion Matrix === a e <--classified as | a = | e = 5

10 Techniques and Results : Supervised Classification Test set: Risk Group34 –Health district number is not a relevant factor –2/5 Risk patients similar to Normal group patients === Confusion Matrix === a c d e <-- classified as | a = | c = 3 (odd) | d = 4 (even) | e = 5

11 Techniques and Results : Association rules search Goal : Find relations that exist among group characteritics Data : 1417 patients of groups 12, 34 and 5 System : –Apriori : B. Goethals implementation –Preprocess : Binary conversion of the 27 characteristics –Frequent Itemsets Search Results : –Frequent itemsets common to different groups

12 Techniques and Results : Association rules search Preprocessing : Binary conversion –BMI : weight / size² (m) If bmi > 27 then 1 else 0 –Age If age > 45 then 1 else 0 –Smoker If smokerconsumption!=0 OR duration then 1 else 0

13 Techniques and Results : Association rules search Pre-processing : Binary conversion –Bolhr (chest pain) If bolhr=1 or bolhr=6 then 0 else 1 –Chol If (chol > 2+(age/100)) then 1 else 0 –Tg If tg<150 then 0 else 1

14 Techniques and Results : Association rules search Frequent itemsets search –Support threshold (Minsup) = 0.10 significant for at least 10% of the population –Search was done with no MinSup (i.e MinSup value = 0) Itemsets Class 12Class 34Class 5 Support value

15

16 Techniques and Results : Association rules search Frequent itemsets search – Results –Support value of Alcohol attribute = 1 –1-itemsets Attribute IM is false for each patient of group 12 and 34. The value is true for 33% of patients of group 5. HT is false for each patient of group 12. STUDY is more frequent in group12 than in group5 –3-itemsets AGE & SMOKER & CHOL is less frequent in group12 than in group5 –etc … –SupportValue Group 34 is between SupportValue Group 12 and SupportValue of Group 5.

17 Conclusion RG similarity with NG and PG. 3 steps: –Clustering: initial groups are not found –Classification: some attributes which characterize the pathological group but already known –Frequent itemsets search: difficult to highlight concrete results but interesting informations

18 Further work Upgrade the binary conversion Refining the data set on the population –for instance, 12 patients died because of atherosclerosis while they were in the NG. Refining our hypothesis –Data set of ENTRY table –Look at the CONTROL table

19 Thanks !