1ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino.

Slides:



Advertisements
Similar presentations
How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.
Advertisements

SDS-Rules and Association Rules March 17, 2004Nicosia, Cyprus Tomáš Karban 1 Jan Rauch 2 Milan Šimůnek 2 1 Charles University, Prague Dept. of Software.
Mining Association Rules from Microarray Gene Expression Data.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Mining Multiple-level Association Rules in Large Databases
Action Rules Discovery /Lecture I/ by Zbigniew W. Ras UNC-Charlotte, USA.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Measures of disease occurrence and frequency
Body mass index and waist circumference as predictors of mortality among older Singaporeans Authors: Angelique Chan, Chetna Malhotra, Rahul Malhotra, Truls.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Associations between Obesity and Depression by Race/Ethnicity and Education among Women: Results from the National Health and Nutrition Examination Survey,
ADULT MALE % BODY FAT. Background  This data was taken to see if there are any variables that impact the % Body Fat in males  Height (inches)  Waist.
SUPERSIZED NATION By Jennifer Ericksen August 24, 2007.
Confidence Intervals This chapter presents the beginning of inferential statistics. We introduce methods for estimating values of these important population.
Cross-sectional study. Definition in Dictionary of pharmaceutical medicine 2009 by G Nahler Dictionary of pharmaceutical medicine cross-sectional study.
Analysis of frequency counts with Chi square
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
© Copyright 2008 STI INNSBRUCK Rule Learning Intelligent Systems – Lecture 10.
Efficient Mining of Both Positive and Negative Association Rules Xindong Wu (*), Chengqi Zhang (+), and Shichao Zhang (+) (*) University of Vermont, USA.
Journal Club Alcohol, Other Drugs, and Health: Current Evidence July-August 2007.
Basic Biostatistics1. 2 In Chapter 1: 1.1 What is Biostatistics? 1.2 Organization of Data 1.3 Types of Measurements 1.4 Data Quality.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Smoking, Drinking and Obesity Hung-Hao Chang* David R. Just Biing-Hwan Lin National Taiwan University Cornell University ERS, USDA Present at National.
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.
2-1 Sample Spaces and Events Conducting an experiment, in day-to-day repetitions of the measurement the results can differ slightly because of small.
Association Rules Olson Yanhong Li. Fuzzy Association Rules Association rules mining provides information to assess significant correlations in large.
Fast Algorithms for Association Rule Mining
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Projecting Future Mortality Using Information on Health Behaviors David M. Cutler, Edward L. Glaeser, and Allison B. Rosen.
Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková.
1 Associative Classification of Imbalanced Datasets Sanjay Chawla School of IT University of Sydney.
Module 2 LIVING FIT: OBESITY & WEIGHT CONTROL. 2 Session I: Obesity Workshop Objectives and Aims To become familiar with issues and causes of obesity.
April 10, 2002Applied Discrete Mathematics Week 10: Relations 1 Counting Relations Example: How many different reflexive relations can be defined on a.
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra.
Analysis of Death Causes in the STULONG Data Set Jan Burian, Jan Rauch EuroMISE – Cardio University of Economics Prague.
A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.
Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم جباری فر( (Dr. jabarifar تاریخ : 1388 / 2010 دانشیار دانشگاه علوم.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
November 15. In Chapter 1: 1.1 What is Biostatistics? 1.2 Organization of Data? 1.3 Types of Measurements 1.4 Data Quality.
Association Rule Mining
Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002.
ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences.
Discovery Challenge – ECML/PKDD2004 September 20, 2004, Pisa, Italy Atherosclerosis Marie Tomečková EuroMISE Centre – Cardio Institute of Computer Science,
Probability and odds Suppose we a frequency distribution for the variable “TB status” The probability of an individual having TB is frequencyRelative.
1 Mining Episode Rules in STULONG dataset N. Méger 1, C. Leschi 1, N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS Université d’Orsay – LRI.
SDS-Rules and Classification Tomáš Karban ECML/PKDD 2003 – Dubrovnik (Cavtat) September 22, 2003.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
1 Body-Mass Index and Mortality in Korean Men and Women Sun Ha Jee, Ph.D., Jae Woong Sull, Ph.D., Jung yong Park, Ph.D., Sang-Yi Lee, M.D. From the Department.
Mining Association Rules in Large Database This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed.
Universidade Federal do Rio de Janeiro - Rio de Janeiro - Brazil
Lecture8 Test forcomparison of proportion
Liver Cancer Around the World
Association Rule Mining
Outlier Discovery/Anomaly Detection
Age standardised all cause, cancer, and cardiovascular mortality during 24 years of follow-up by number of lifestyle risk factors Age standardised all.
Obesity in Today’s Society
Presentation transcript:

1ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino * *work sponsored by CNPq research grant /00-8 Universidade Federal Fluminense Department of Computer Science Niterói, Rio de Janeiro, Brazil -

2ECML / PKDD 2004 Discovery Challenge 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary Outline of the talk 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary

3ECML / PKDD 2004 Discovery Challenge Atherosclerosis Data Set STULONG Data Set: risk factors of atherosclerosis in a population of 1417 middle aged men from Czech Republic. Four tables are included in this data set: Entry: data related to entry examinations performed on these men (the first step of the STULONG project). Control: data related to long-term observations. Letter: additional information about the health status of 403 men. Death: data related to the patients that became dead.

4ECML / PKDD 2004 Discovery Challenge Basic Groups of Patients The patients were classified into three basic groups, according to the results of the entry examinations: A. Normal Group : men without the presence of any risk factor. B. Risk Group : men with the presence of one or more risk factors. C. Pathologic Group : men with either an identified cardiovascular disease or other serious disease.

5ECML / PKDD 2004 Discovery Challenge The main contribution of this work is to present strong association rules and exceptions mined from the Entry table. The mining process was driven into discovering relations among the following characteristics of the patients in the basic groups: Social factors. Physical activities during free time. Alcohol consumption. Smoking. Results of the biochemical examinations and the physical check-up. Contribution

6ECML / PKDD 2004 Discovery Challenge 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary Outline of the talk 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary

7ECML / PKDD 2004 Discovery Challenge Multidimensional Association Rules Multidimensional Association Rules (J. Han and M. Kamber, 2001) represent combinations of attribute values that often occur together in a database. They can be mined from relational databases or data warehouses. Example: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) meaning: “men who are heavy beer consumers tend to be also heavy smokers”. This rule involves two attributes (or dimensions): DailyBeerCons and Smoking.

8ECML / PKDD 2004 Discovery Challenge Multidimensional Association Rules Formal Definition A 1 = a 1,..., A n = a n  B 1 = b 1,..., B m = b m A i (1  i  n) and B j (1  j  m) : distinct attributes (dimensions) from a database relation. a i and b j : values from the domains of A i and B j, respectively. generic representation: A  B A is the antecedent and B is the consequent of the rule. Several attributes can be involved in both the antecedent and the consequent.

9ECML / PKDD 2004 Discovery Challenge Interest Measures: Support and Confidence Support index (Sup): the probability that a tuple matches all conditions in A  B. Confidence index (Conf): the probability that a tuple matches B, given that it matches A. Sup(A  B) = P(A,B) and Conf(A  B) = P(B|A). The support indicates the relevance and the confidence indicates the validity of an association rule. Support / Confidence Framework (Agrawal et al, 1993): finding all rules that match user-provided minimum support and minimum confidence.

10ECML / PKDD 2004 Discovery Challenge Interest Measures: Support and Confidence Problems with the Support / Confidence Framework (Brin et al, 1997):  generation of a huge number of rules:  most of these rules are often obvious.  In many cases, these rules express relations that are not true.

11ECML / PKDD 2004 Discovery Challenge IdAssociation RuleSup A Sup B SupConf R1 (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) R2 (DailyBeerCons = “>1l”)  (Married = “yes”) Interest Measures: Support and Confidence The support and confidence values of R2 are higher than the R1 ones. Is R2, in fact, more interesting than R1?

12ECML / PKDD 2004 Discovery Challenge Negative Dependence IdAssociation RuleSup A Sup B SupConf R2 (DailyBeerCons = “>1l”)  (Married = “yes”) R2 should imply that men who are heavy beer consumers tend to be married % of men are married. However, the probability for a man to be married, given that he is a heavy beer consumer is 75.84%. Heavy beer consumers are, in fact, less likely to be married. There is a negative dependence between being married and being a heavy beer consumer.

13ECML / PKDD 2004 Discovery Challenge Positive Dependence IdAssociation RuleSup A Sup B SupConf R1 (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) % of men are heavy smokers. The probability for a man to be a heavy smoker, given that he is a heavy beer consumer is 37.58%. Heavy beer consumers are more likely to smoke a lot. There is a positive dependence between being a heavy beer consumer and being a heavy smoker.

14ECML / PKDD 2004 Discovery Challenge Strong Association Rule IdAssociation RuleSup A Sup B SupConf R1 (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) R2 (DailyBeerCons = “>1l”)  (Married = “yes”) Conclusions: R1 is a strong association rule, while R2 is not true. In order to mine interesting information, we need to evaluate the type of dependence between the antecedent and the consequent of a rule.

15ECML / PKDD 2004 Discovery Challenge Lift: how much more frequent is B when A occurs. Lift(A  B) = Conf(A  B)  Sup(B) RI - Rule Interest ( G. Piatetsky-Shapiro, 1991 ): computes the percentage of additional tuples matched by an association rule that are above the expected. RI(A  B) = Sup(A  B) - Sup(A) x Sup(B) We believe that the use of different interest measures (Sup, Conf, Lift and RI) provides alternative analysis of the same data, giving a better understanding about the associations. Lift and RI

16ECML / PKDD 2004 Discovery Challenge 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary Outline of the talk 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary

17ECML / PKDD 2004 Discovery Challenge In our approach, exceptions represent association rules that become much weaker in some specific subsets of the database. Mined exception: (DailyBeerCons = “>1l”) & (Age = “ 50”)  (Smoking = “>20 cig/day”) meaning: “among the men who are 50 years old or above, the support value of the association between being a heavy beer consumer and being a heavy smoker is surprisingly smaller than what is expected”. Exceptions Example: Does the rule (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) become weaker on any subset of the database?

18ECML / PKDD 2004 Discovery Challenge Exceptions (DailyBeerCons = “>1l”) & (Age = “ 50”)  (Smoking = “>20 cig/day”)  This exception was obtained because the conventional rule (DailyBeerCons = “>1l”) & (Age = “50”)  (Smoking = “>20 cig/day”) did not achieve an expected support.  This expected support is evaluated from the support of the original rule (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) and the support of the condition (Age = “50”).

19ECML / PKDD 2004 Discovery Challenge  Let D be a database relation.  Let R: A  B be a multidimensional association rule.  Let Z = {Z 1 = z 1,..., Z k = Z k } be a set of conditions defined over D, where Z  A  B = . Z is named as probe set.  An exception related to the positive rule R is an implication of the form: A  Z  B Exceptions: Formal Definition

20ECML / PKDD 2004 Discovery Challenge Exceptions are extracted from candidate exceptions. A candidate exception is an expression in the form: A  Z  B Exceptions are mined only if the candidates do not achieve an expected support. This expectation is evaluated based on the support of the original rule A  B and the support of the conditions that compose the probe set Z: ExpSup(A  Z  B) = Sup(A  B) x Sup(Z) Candidate Exceptions

21ECML / PKDD 2004 Discovery Challenge The Interest Measure (IM) Index We developed two interest measures to evaluate the degree of interestingness of an exception. The IM (Interest Measure) index evaluates the strength (relevance) of an exception. IM(E) = 1 - (Sup(A  Z  B)  ExpSup(A  Z  B)) An exception E is potentially interesting if the actual support value of Sup(A  Z  B) is much lower than its expected support value. This measure captures the type of dependence between Z and A  B. The closer the value is from 1, the more the negative dependence.

22ECML / PKDD 2004 Discovery Challenge R: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) - Sup(R) = 4.48% Z = {(Age = “ 50”)} - Sup(Z) = 22.82% Example of the IM Index The expected support for A  Z  B can be computed as 4.48% x 22.82% = 1.02%. The actual support of A  Z  B is 0.48%. The exception E1: A  Z  B is potentially interesting because IM(E1) = 1 - (0.48  1.02) = The actual support value of E1 is 53% lower than what is expected.

23ECML / PKDD 2004 Discovery Challenge Degree of Unexpectedness A high value for the IM measure is not a guarantee that we found interesting information. R: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) Sup(R) = 4.48% Z = {(Alcohol = “no”)} - Sup(Z) = 9.47%  The expected support for A  Z  B can be computed as 4.48% x 9.47% = 0.42%.  The actual support for this candidate rule is 0.00%.  IM(A  Z  B) = 1 - (0.00  0.48) =  However, this exception represents na information that is obvious. The IM index could not detect the strong negative dependence between A and Z.

24ECML / PKDD 2004 Discovery Challenge Degree of Unexpectedness The DU (Degree of Unexpectedness ) Index is used to determine the validity of an exception. This measure captures how much the negative dependence between a probe set Z and a rule A  B is higher than the negative dependence between Z and either A and B. DU(E) = IM(E) - max(1 - Sup(A  Z)  ExpSup(A  Z), 1 - Sup(B  Z)  ExpSup(B  Z)) The greater the value is from 0, the more interesting the exception will be. If DU(E)  0 the exception is uninteresting.

25ECML / PKDD 2004 Discovery Challenge Example of the DU Index R: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) Sup(R) =4.48% --- Sup(A) =11.93% --- Sup(B) =26.02% Z = {(Age = “ 50”)} Sup(Z)= 22.82% --- Sup(A  Z)= 2.00% --- Sup(B  Z)= 6.00% 1) compute the negative dependence between A and Z:  1 - (2.00%  (11.93% x 22.82%)) = ) compute the negative dependence between B and Z: 1 - (6.00%  (26.02% x 22.82%)) = The exception E1: A  Z  B is, in fact, interesting because: DU(E1) = max(0.27,-0.01) = 0.26

26ECML / PKDD 2004 Discovery Challenge 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary Outline of the talk 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary

27ECML / PKDD 2004 Discovery Challenge The following relations in the ARFF format (Witten and Frank, 2000) were generated from the original Entry table: ENTRY TOT : 1249 tuples (men from groups A, B and C). ENTRY A : 276 tuples (only men from group A). ENTRY B : 859 tuples (only men from group B). ENTRY C : 114 tuples (only men from group C). Data Preparation

28ECML / PKDD 2004 Discovery Challenge FieldPossible Values Cholesterol“desirable” (<200), “bordering” (200 – 239), “high” (  240). Triglycerides“desirable” (<150), “bordering” (150 – 200), “high” ( ), “very high” (  500). BMI (body mass index) “underweight” ( bmi < 20), “normal” (20  bmi < 25), “overweight” (25  bmi < 30), “obese” (30  bmi < 40), “morbidly obese” (bmi  40). Blood Pressure“normal”, “normal / high”, “high” Skin Folds“8-20”, “21-30”, “31-40”, “>40” Age “38-39”, “40-44”, “45-49”, “  50” Data Preparation Data was enriched with new fields and the continuous attributes were discretized.

29ECML / PKDD 2004 Discovery Challenge 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary Outline of the talk 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary

30ECML / PKDD 2004 Discovery Challenge Results We developed two programs in C++ (g++ compiler): MULTMINE: used to mine strong multidimensional association rules. EXCEPMINE: used to mine exceptions. We use the following thresholds on the experiments: Minimum support = 1% (MULTMINE). Minimum IM = 0.30 and minimum DU = 0.05 (EXCEPMINE).

31ECML / PKDD 2004 Discovery Challenge Group A - Entry ALL Sup A Sup B SupConfLiftRI (Group = “A”)  (Education = “university”)  Group A is the only one where men with university degree are in the majority (Conf = ). Sup A Sup B SupConfLiftRI (Group = “A”)  (PhysActAfterJob = “great activity”)  There is a strong positive dependence between belonging to Group A and practicing physical actvities intensely in free time (lift = 1.692).

32ECML / PKDD 2004 Discovery Challenge Alcohol Consumption x Smoking GroupSup A Sup B SupConfLiftRI A B C (DailyBeerCons = “>1l”)  (SmokingDuration = “>20 years”)  Drinking a lot and smoking for more than 20 years are positively dependent in groups A, B, and C (Lift and RI columns).  However, there are much fewer smokers in Group A (Sup B column). In groups B and C, the greatest part of the heavy beer consumers smoked cigarettes for more than 20 years (Conf column).  Men from group B tend to smoke and drink more (Sup A, Sup B and Sup columns).

33ECML / PKDD 2004 Discovery Challenge Alcohol Consumption x Cholesterol GroupSup A Sup B SupConfLiftRI A B C (Alcohol = “No”)  (Cholesterol = “desirable”)  Not drinking alcohol and having the cholesterol in the desirable range are positively dependent in groups A, B, and C (Lift and RI columns).  There are less alcohol consumers in Group C (Sup A column).  In group A, the greatest part of the men who do not drink alcohol have the cholesterol in the desirable range (Conf column).

34ECML / PKDD 2004 Discovery Challenge Education x Smoking GroupSup A Sup B SupConfLiftRI A B C (Education = “university”)  (Smoking = “no”)  People with the highest education degree are less likely to be smokers (Lift and RI columns).  In groups A and C, the majority of men with university degree do not smoke (Conf column). The support of this rule is very high in group A.  In group B, most of them are smokers (Conf column). However, not smoking and having reached university degree still are very positively dependent (Lift and RI columns).

35ECML / PKDD 2004 Discovery Challenge Skin Folds x Body Mass Index GroupSup A Sup B SupConfLiftRI A B C (Skin Folds = “ 20”)  (BMI = “normal”)  Most of the men who have the body mass index into the normal range were classified into the lowest range of the attribute Skin Folds (Conf column).  Both attributes are highly positive dependent (Lift and RI columns).  There are much fewer people who have normal BMI in Group C (Sup B column).

36ECML / PKDD 2004 Discovery Challenge Exceptions (Education = “apprentice school ”) & (PhysActAfterJob = “great act.”)  (Smoking = “15-20 cig day”) IM = , DU =  Original rule: “people whose education degree is apprentice school tend to smoke a lot”.  Exception: Among the men who practice physical activities intensely in free time, the support value of the original rule is 47.55% smaller than what is expected.  The degree of unexpectedness is equal to 20.69%.

37ECML / PKDD 2004 Discovery Challenge Exceptions (Education = “university ”) & (Group = “C”)  (BMI = “normal”) IM = , DU =  Original rule: “people with the highest education degree tend to have the body mass index into the normal range”.  Exception: Among the men who belong to Group C, the support value of the original rule is 70.18% smaller than what is expected.  The degree of unexpectedness is equal to 30.52%.

38ECML / PKDD 2004 Discovery Challenge 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary Outline of the talk 1.Atherosclerosis Data Set 2.Multidimensional Association Rules 3.Exceptions 4.Data Preparation 5.Results 6.Summary

39ECML / PKDD 2004 Discovery Challenge Summary We presented some strong association rules and exceptions mined from the STULONG Data Set, concerning the entry examinations. Strong association rules evaluated the differences of the correlations concerning the characteristics of the patients from the three basic groups. Exceptions indicated negative patterns associated with previously known strong positive rules. These exceptions were mined from candidates that do not achieve an expected support value.

40ECML / PKDD 2004 Discovery Challenge Apply the same approach to the relations: Letter, Control and Death. Besides mining rules with large deviation between the actual and the expected support, we intend to investigate the interestingness of rules with large deviation between the actual and the expected confidence value. Future Work

41ECML / PKDD 2004 Discovery Challenge Universidade  Federal Fluminense Universidade Federal Fluminense Niterói, Rio de Janeiro, Brazil Thank  you !!Thank  you !!Thank  you !!Thank  you !!