1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez D., Serrano J.M., Vila M.A. University of Granada (Spain)
Discovery Challenge – ECML/PKDD Introduction KDD allow us to obtain relations within data. Non-trivial. Previously unknown. Potentially useful. Fuzzy data KDD tools and techniques extensions.
Discovery Challenge – ECML/PKDD Problem representation Fuzzy relational database. a ij values: Numeric, scalar (nominal), linguistic labels. Membership degrees. Fuzzy similarity relations, S A 1,..., S A m. t#t#A1A1 A2A2...AmAm t1t1 a 11, t1 (A 1 )a 12, t1 (A 2 )... a 1m, t1 (A m ) t2t2 a 21, t2 (A 1 )a 22, t2 (A 2 )... a 2m, t2 (A m ) t3t3 a 31, t3 (A 1 )a 32, t3 (A 2 )... a 3m, t3 (A m ) ……... …
Discovery Challenge – ECML/PKDD Fuzzy Approximate Dependencies We define Fuzzy Approximate Dependencies relaxing some properties in Functional Dependencies, V W t,s t[V] = s[V] t[W] = s[W] Equality relaxation Considering linguistic labels and membership degrees Universal quatifier relaxation (exceptions allowing)
Discovery Challenge – ECML/PKDD FAD Measures Relevance degree: Support, supp(V W) Fulfilment degrees: Confidence, conf(V W) Certainty factor, CF(V W) [Shortliffe and Buchanan, 1975] Measures belief degree variations. CF(V W) = 1 Maximum increment (Perfect positive). CF(V W) = –1 Maximum decrement. CF(V W) = 0 Statistical independence.
Discovery Challenge – ECML/PKDD Applications Fuzzy Databases. Approximate Dependencies Discovery. Functional Dependencies Discovery. Other applications: Low granularity data. Overlapping semantics.
Discovery Challenge – ECML/PKDD STULONG Database Entry Table. Normal Group (attribute KONSKUP having values 1 or 2). Risk Group (attribute KONSKUP having values 3 or 4). Pathologic Group (value 5 for attribute KONSKUP).
Discovery Challenge – ECML/PKDD Data Preprocessing (I) Problem: Semantic overlapping in symbolic or scalar attributes. Similarity fuzzy relations (subjective). I.e.: DOPRAVA (Means of transport for getting to work): by bikepublic meanscarnot stated on foot by bike public means0.40.0
Discovery Challenge – ECML/PKDD Data Preprocessing (II) Problem: High granularity in numeric attributes. Linguistic labels sets definition starting from intervals. Numeric value P.e.: BMI (Body mass index): thinoverweight
Discovery Challenge – ECML/PKDD Analytical Questions (I) Dependencies between social factors and physical activity. ROKVSTUPSTAVVZDELANIZODPOV TELAKTZA0.67/ / /0.28 AKTPOZAM0.14/ / / /0.47 DOPRAVA0.20/ / / /0.32 DOPRATRV0.17/ / / /0.44
Discovery Challenge – ECML/PKDD Analytical Questions (II) Dependencies between social factors and smoking. ROKVSTUPSTAVVZDELANIZODPOV KOURENI0.68/0.07 DOBAKOUR0.64/ /0.25 BYVKURAK0.10/ / / /0.64
Discovery Challenge – ECML/PKDD Analytical Questions (III) Dependencies between social factors and alcohol consumption. ROKVSTUPSTAVVZDELANIZODPOV ALKOHOL0.21/ / / /0.31 PIVO100.16/ / / /0.41 PIVO120.10/ / / /0.61 VINO0.16/ / / /0.41 LIHOV0.16/ / / /0.41 PIVOMN0.21/ / / /0.29 VINOMN0.20/ / / /0.31 LIHMN0.20/ / / /0.29
Discovery Challenge – ECML/PKDD Analytical Questions (IV) Dependencies between social factors and physical features. ROKVSTUPSTAVVZDELANIZODPOV BMI0.16/ / / /0.42 SYST10.65/ /0.26 DIAST10.19/ / / /0.30 SYST20.65/ /0.25 DIAST20.19/ / / /0.30
Discovery Challenge – ECML/PKDD Analytical Questions (V) Dependencies between physical activity and smoking. TELAKTZAAKTPOZAMDOPRAVADOPRATRV KOURENI0.50/ /0.13 DOBAKOUR0.27/ / / /0.19 BYVKURAK0.13/ / / /0.55
Discovery Challenge – ECML/PKDD Analytical Questions (VI) Dependencies between physical activity and alcohol consumption. TELAKTZAAKTPOZAMDOPRAVADOPRATRV ALKOHOL0.27/ / / /0.25 PIVO100.22/ / / /0.33 PIVO120.14/ / / /0.50 VINO0.22/ / / /0.33 LIHOV0.22/ / / /0.33 PIVOMN0.27/ / / /0.24 VINOMN0.27/ / / /0.24 LIHMN0.27/ / / /0.23
Discovery Challenge – ECML/PKDD Analytical Questions (VII) Dependencies between physical activity and physical features. TELAKTZAAKTPOZAMDOPRAVADOPRATRV BMI0.21/ / / /0.34 SYST10.27/ / / /0.21 DIAST10.25/ / / /0.23 SYST20.27/ / / /0.20 DIAST20.25/ / / /0.24
Discovery Challenge – ECML/PKDD Analytical Questions (VIII) Dependencies between physical activity and cholesterol degrees. TELAKTZAAKTPOZAMDOPRAVADOPRATRV CHLST0.28/ / / /0.19 TRIGL0.49/ /0.14
Discovery Challenge – ECML/PKDD Analytical Questions (IX) Dependencies between alcohol consumption and physical features. BMISYST1DIAST1SYST2DIAST2 ALKOHOL0.40/ / / / /0.29 PIVO100.35/ / / / /0.38 PIVO120.25/ / / / /0.58 VINO0.35/ / / / /0.38 LIHOV0.35/ / / / /0.38 PIVOMN0.41/ / / / /0.27 VINOMN0.40/ / / / /0.28 LIHMN0.41/ / / / /0.27
Discovery Challenge – ECML/PKDD Analytical Questions (X) Dependencies between alcohol consumption and smoking. KOURENIDOBAKOURBYVKURAK ALKOHOL0.23/ /0.15 PIVO100.13/ / /0.22 PIVO120.08/ / /0.40 VINO0.13/ / /0.22 LIHOV0.13/ / /0.22 PIVOMN0.23/ /0.14 VINOMN0.23/ /0.15 LIHMN0.24/ /0.14
Discovery Challenge – ECML/PKDD Analytical Questions (XI) Dependencies between skin folds and BMI, [TRIC] [BMI], supp 15.85%, CF 0.54 [SUBSC] [BMI], supp 17.28%, CF 0.58
Discovery Challenge – ECML/PKDD Concluding Remarks FAD’s allow us to discover relations within imprecise or uncertain data. Experts aid is desirable. Data preprocessing. Results interpretation.