Discovering Constrained Association Rules to Predict Heart Disease

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Research Curriculum Session II –Study Subjects, Variables and Outcome Measures Jim Quinn MD MS Research Director, Division of Emergency Medicine Stanford.
Mining Association Rules from Microarray Gene Expression Data.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Associative Classification (AC) Mining for A Personnel Scheduling Problem Fadi Thabtah.
Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Lecture-19 ETL Detail: Data Cleansing
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Research Project Mining Negative Rules in Large Databases using GRD.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
How do we know whether a marker or model is any good? A discussion of some simple decision analytic methods Carrie Bennette on behalf of Andrew Vickers.
1 An Excel-based Data Mining Tool Chapter The iData Analyzer.
Data Mining – Intro.
Chapter 5 Data mining : A Closer Look.
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Next Generation Techniques: Trees, Network and Rules
Data Mining Chun-Hung Chou
2 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion Objectives, Prerequisite and.
1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.
How do we know whether a marker or model is any good? A discussion of some simple decision analytic methods Carrie Bennette (on behalf of Andrew Vickers)
Data Mining By Fu-Chun (Tracy) Juang. What is Data Mining? ► The process of analyzing LARGE databases to find useful patterns. ► Attempts to discover.
Data Mining By Dave Maung.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Design of an Expert System for Enhancing.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
C OMPARING A SSOCIATION R ULES AND D ECISION T REES FOR D ISEASE P REDICTION Carlos Ordonez.
Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
Data Mining – Intro.
Knowledge Discovery State of the Art
Data Transformation: Normalization
Data Science Algorithms: The Basic Methods
DATA MINING © Prentice Hall.
Rule Induction for Classification Using
Data Mining Jim King.
Presented by: Dr Beatriz de la Iglesia
A Methodology for Finding Bad Data
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Fenglong Ma1, Jing Gao1, Qiuling Suo1
An Excel-based Data Mining Tool
Gerd Kortemeyer, William F. Punch
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
Table of Contents Research Objectives
Market Basket Analysis and Association Rules
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Data Science in Industry
Presentation transcript:

Discovering Constrained Association Rules to Predict Heart Disease Carlos Ordonez*, Edward Omiecinski, Levien de Braal, Cesar Santana, et.al. Georgia Tech Emory University *working for Teradata (NCR) IEEE ICDM 2001

Motivation Goal: help in heart disease diagnosis Basic Data Mining technique Similar to expert system rules Combinatorial: causes=>disease Simplicity: easy to interpret Privacy preserving Reliability: having two statistical measures

Medical data issues Rich attribute types. Attributes must be transformed into binary. Small data set size, n=655 patients Noisy, there exist many missing values. Errors in data collection. Naive approach: thousands/millions of associations and rules. Negation makes problem worse

Good rules: IF Age>=70, Smokes=Y, Gender=M THEN RCA>=50 s=0.4 c=1 IF Gender=F, Age<70 THEN LAD>=70 s=0.2 c=1.0 IF Gender=M, Age<70 THEN RCA<50 Bad rules: IF Age>=70 THEN Smokes=Y IF LAD>=70 THEN RCA>=50, IF Gender=M,Age>=60,Smokes=Y THEN LAD,RCA

Algorithm overview Map attributes to items Mine association rules (A-priori) Phase 1: generate frequent associations above minimum support Phase 2: generate rules with minimum confidence

Mapping attributes to binary data Uniformly treat as categorical or numerical Manual: ranges are determined by MD Each categorical value becomes an item Each numerical range becomes an item. Missing info handling simplified Each value/range can be negated

Important constraints Max rule size: simplicity. Phase 1 faster. A: Antecedent, C: Consequent. Medically meaningful. Phase 2 faster. G: Group constraint: eliminate trivial or irrelevant associations. Phase 1 and 2 faster. Negation: more combinations Support= 2/n

Medical attributes

Experimental results Minimum support frequency: 2 Max rule size: 4 Time: 12 minutes Associations: 36,982. 10% of time Rules: 2,987. 90% of time.

Medical significance Specificity Sensitivity Gold standard: catheterization

Usage of rules Confirming knowledge. Used to validate Expert System IF-THEN rules Discovering knowledge. Surprising to domain expert. Distinguish healthy and sick patients

Rules predicting no heart disease IF Sex=F THEN 0<=LCX<50, s=22% c=73% IF Smokes=N THEN not(70<=RCA<100.1), s=29% c=71% IF Age<40,Diab=N THEN 0<=0 LAD<50, s=2% c=82% IF 40<=Age<60,Sex=F,Diab=N THEN RCA<50, s=7% c=80%

Rules predicting heart disease IF 0.2<=AP<1.1,PCarSur=Y THEN not(LAD<50) not(RCA<50), s=1% c=80% IF 60<Age, 0.2<=AP<1.1,Smokes=Y THEN not(LAD<50) s=10% c=83% IF 60<Age, 0.2<=SA<1.1,FHCAD=Y THEN not(LAD<50) s=2% c=100% IF 60<Age, 0.2<=AP<1.1,Sex=F THEN not(LAD<50) s=5% c=94%

Conclusions Mapping attributes is required Constraining is essential Some of the findings were unexpected Future work: find more useful constraints, finer ranges, improve missing info handling, validate by clustering and decision trees