ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten, Darek Krzywania, Jan Struyf, Hendrik Blockeel Department of Computer Science Katholieke Universiteit Leuven
ECML/PKDD 2003 Discovery Challenge Data Mining Effort Data Data preprocessing Data mining Evaluation criteria Discovered Knowledge Initial exploration Entry data Control data Conclusions Outline
ECML/PKDD 2003 Discovery Challenge Data we studied 2 of the 4 data matrices from the STULONG data set: the Entry data matrix the Control data matrix men in the Entry data are divided into 3 subgroups based on occurrence of risk factors: normal group (NG): non of these risk factors risk group (RG): at least one of the risk factors pathological group (PG): manifested serious disease
ECML/PKDD 2003 Discovery Challenge Data preprocessing missing values / empty entries / “not stated” / “no” data = propositionalisation of relational database many empty entries + redundancies (eg. personal anamnesis) 1-n relation from Entry to Control data set solution: relational representation (ILP) background knowledge can be used new features for trend analysis in control examinations
ECML/PKDD 2003 Discovery Challenge Attribute-value: Entry data set converted to Weka.arff format introduction of new attributes (eg. BMI, …) Relational: Entry + Control data set converted to relational ILP format introduction of background knowledge Data preprocessing
ECML/PKDD 2003 Discovery Challenge Data mining Entry data in.arff format Weka classification (ZeroR, OneR, NB, Decision Stump, Decision Table, J48, …) regression (Linear Regression, M5’) association rules (Apriori) Entry + Control data in ILP format ACE classification (Tilde) regression (Tilde) since data distributions are skewed, better use regression to predict chance of being positive/negative instead of using classification
ECML/PKDD 2003 Discovery Challenge Evaluation criteria 10-fold cross-validation classifiers ROC – analysis (Area Under Curve) accuracy regression models Relative error (RE) Pearson’s correlation coefficient (r)
ECML/PKDD 2003 Discovery Challenge Data Mining Effort Data Data preprocessing Data mining Evaluation criteria Discovered Knowledge Initial exploration Entry data Control data Conclusions Outline
ECML/PKDD 2003 Discovery Challenge Initial exploration of Entry Comparison of mean values of attributes for the three subgroups reached education responsibility in job physical activity in job physical activity after job Skinfold above musculus triceps Skinfold above musculus subscpularis
ECML/PKDD 2003 Discovery Challenge Initial exploration of Entry Correlation between BMI and skin fold for the three subgroups
ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Relations between social factors and other characteristics education level physical activity in job education level smoking pensioner drinking age blood pressure Relations between physical activities and other characteristics activity after jobsmoking duration of way to work drinking...
ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Correlation between skinfolds and BMI in particular risk groups regression task: predict BMI using SUBSC and TRIC classification task: predict OVERWEIGHT(OW) (1 if BMI >25 else 0) ExperimentSizeACCRAErAUC OW_T6.071% OW_NG0.653% OW_RG3.974% OW_PG1.075% BMI_T BMI_NG BMI_RG BMI_PG
ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Correlation between skinfolds and BMI in particular subgroups correlation is strongest in risk group for all different groups SUBSC > ±15 is most important split to distinguish between overweight en non-overweight SUBSC BMI influence of TRIC on BMI less than influence of SUBSC
ECML/PKDD 2003 Discovery Challenge Correlation between skinfolds and BMI in particular subgroups Results from the Entry data set TRIC < 15SUBSC < 10SUBSC < 15SUBSC < 20SUBSC < 70SUBSC < 35 ex. risk group: yesno yes no yes no yes
ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Staying healthy in the risk group (RG) task: predict if person of RG came down with cardio disease new attribute ILL introduced based on HODN0 attr from Control no good performance (most correlation coefficients < 0.05) best correlation (0.15) for cholesterol level if cholesterol < 250 then chance to stay healthy
ECML/PKDD 2003 Discovery Challenge Results from the Control data set relational Control data setTilde task: predict whether person from risk group comes down with cardio disease (1) or not (0) use only controlexaminations (ce) before patient’s cardio disease: ce.year ≤ ROK i numeric attributes: extra features compute trend over different ce’s slope of least squares model of attr. over time interval T – N T: start of patient’s first disease N: parameter chosen by Tilde
ECML/PKDD 2003 Discovery Challenge Results from the Control data set Input attributesSizeACCRAErAUCAUC (33%) Job1.068% Physical activity0.168% Smoking3.767% Diet0.068% BMI1.467% Blood Pressure3.363% Cholesterol9.164% Glycaemia & Uric acid3.366% BMI & Cholesterol10.663% Smoking & Cholesterol12.563% All8.566% Statistics on the Control data experiments
ECML/PKDD 2003 Discovery Challenge Results from the Control data set Some interesting subgroups from the decision trees: proportion of class 1 in whole group = 32% total population = 1417 IF glycaemia > 7.2 and BMI > 23.5 in each examination and diastolic blood pressure slope during last 10 years < -77 THEN 64% (103) IF systolic blood pressure slope during last 20 years < THEN 53% (122) IF glycaemia > 7.2 in each examination THEN 48% (434) If patient leaves to full retirement in some examination THEN 20% (233) IF reduced smoking in some examination and slope in number of cigarettes during last 20 years THEN 16% (116) IF glycaemia < 7.2 in some examination THEN 7% (285)
ECML/PKDD 2003 Discovery Challenge Results from the Control data set Glycaemia most important attribute also blood pressure, cholesterol and smoking … slope of numeric attributes very useful statistics may be negatively biased due to cross-validation
ECML/PKDD 2003 Discovery Challenge Conclusions used variety of data mining algorithms propositional techniques multi-relational techniques results consistent over different algorithms much discovered knowledge difficult to handle interpretation of results by domain experts is necessary carefull handling of results if accuracy of classifier not larger than predicting the average classifier can still be informative!!
ECML/PKDD 2003 Discovery Challenge 1 21 The End Thanks for your attention!!