Discovering Constrained Association Rules to Predict Heart Disease Carlos Ordonez*, Edward Omiecinski, Levien de Braal, Cesar Santana, et.al. Georgia Tech Emory University *working for Teradata (NCR) IEEE ICDM 2001
Motivation Goal: help in heart disease diagnosis Basic Data Mining technique Similar to expert system rules Combinatorial: causes=>disease Simplicity: easy to interpret Privacy preserving Reliability: having two statistical measures
Medical data issues Rich attribute types. Attributes must be transformed into binary. Small data set size, n=655 patients Noisy, there exist many missing values. Errors in data collection. Naive approach: thousands/millions of associations and rules. Negation makes problem worse
Good rules: IF Age>=70, Smokes=Y, Gender=M THEN RCA>=50 s=0.4 c=1 IF Gender=F, Age<70 THEN LAD>=70 s=0.2 c=1.0 IF Gender=M, Age<70 THEN RCA<50 Bad rules: IF Age>=70 THEN Smokes=Y IF LAD>=70 THEN RCA>=50, IF Gender=M,Age>=60,Smokes=Y THEN LAD,RCA
Algorithm overview Map attributes to items Mine association rules (A-priori) Phase 1: generate frequent associations above minimum support Phase 2: generate rules with minimum confidence
Mapping attributes to binary data Uniformly treat as categorical or numerical Manual: ranges are determined by MD Each categorical value becomes an item Each numerical range becomes an item. Missing info handling simplified Each value/range can be negated
Important constraints Max rule size: simplicity. Phase 1 faster. A: Antecedent, C: Consequent. Medically meaningful. Phase 2 faster. G: Group constraint: eliminate trivial or irrelevant associations. Phase 1 and 2 faster. Negation: more combinations Support= 2/n
Medical attributes
Experimental results Minimum support frequency: 2 Max rule size: 4 Time: 12 minutes Associations: 36,982. 10% of time Rules: 2,987. 90% of time.
Medical significance Specificity Sensitivity Gold standard: catheterization
Usage of rules Confirming knowledge. Used to validate Expert System IF-THEN rules Discovering knowledge. Surprising to domain expert. Distinguish healthy and sick patients
Rules predicting no heart disease IF Sex=F THEN 0<=LCX<50, s=22% c=73% IF Smokes=N THEN not(70<=RCA<100.1), s=29% c=71% IF Age<40,Diab=N THEN 0<=0 LAD<50, s=2% c=82% IF 40<=Age<60,Sex=F,Diab=N THEN RCA<50, s=7% c=80%
Rules predicting heart disease IF 0.2<=AP<1.1,PCarSur=Y THEN not(LAD<50) not(RCA<50), s=1% c=80% IF 60<Age, 0.2<=AP<1.1,Smokes=Y THEN not(LAD<50) s=10% c=83% IF 60<Age, 0.2<=SA<1.1,FHCAD=Y THEN not(LAD<50) s=2% c=100% IF 60<Age, 0.2<=AP<1.1,Sex=F THEN not(LAD<50) s=5% c=94%
Conclusions Mapping attributes is required Constraining is essential Some of the findings were unexpected Future work: find more useful constraints, finer ranges, improve missing info handling, validate by clustering and decision trees