SDS-Rules and Classification Tomáš Karban ECML/PKDD 2003 – Dubrovnik (Cavtat) September 22, 2003.

Slides:



Advertisements
Similar presentations
Groupe de travail athérosclérose 1 STULONG Discovery Challenges Feedback Marie Tomečková EuroMISE – Cardio This work is supported by the project LN00B107.
Advertisements

How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.
Presentation on Artificial Intelligence
SDS-Rules and Association Rules March 17, 2004Nicosia, Cyprus Tomáš Karban 1 Jan Rauch 2 Milan Šimůnek 2 1 Charles University, Prague Dept. of Software.
Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
B1 Higher Fitness and health Staying healthy – parasites, infection (defence, immunisation) and lifestyle Diet and disease Drugs Homeostasis (temperature.
Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Chapter 10.  Real life problems are usually different than just estimation of population statistics.  We try on the basis of experimental evidence Whether.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
AtherEx: an Expert System for Atherosclerosis Risk Assessment Petr Berka, Vladimír Laš University of Economics, Prague Marie Tomečková Institute of Computer.
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Statistics Are Fun! Analysis of Variance
Irwin/McGraw-Hill © The McGraw-Hill Companies, Inc., 2000 LIND MASON MARCHAL 1-1 Chapter Thirteen Nonparametric Methods: Chi-Square Applications GOALS.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
SEWEBAR - a Framework for Creating and Dissemination of Analytical Reports from Data Mining Jan Rauch, Milan Šimůnek University of Economics, Prague, Czech.
CBR in Medicine Jen Bayzick CSE435 – Intelligent Decision Support Systems.
Adrian Edwards Shared Decision Making in Cardiology: Training Workshop.
Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková.
Medical statistics.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Cross-sectional study
Chapter 11 LEARNING FROM DATA. Chapter 11: Learning From Data Outline  The “Learning” Concept  Data Visualization  Neural Networks The Basics Supervised.
1 Be humble in our attribute, be loving and varying in our attitude, that is the way to live in heaven.
1ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino.
Analysis of Death Causes in the STULONG Data Set Jan Burian, Jan Rauch EuroMISE – Cardio University of Economics Prague.
A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.
HASAR : Mining Sequential Association Rules for Atherosclerosis Risk Factor Analysis Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard.
MUDIM (Petr Šimeček, Euromise) system for multidimensional compositional models (Radim Jiroušek) C++ code, distributed as R-package focused on medical.
Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.
1 Classes of association rules short overview Jan Rauch, Department of Knowledge and Information Engineering University of Economics, Prague.
Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests.
The binomial applied: absolute and relative risks, chi-square.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Chapter-8 Chi-square test. Ⅰ The mathematical properties of chi-square distribution  Types of chi-square tests  Chi-square test  Chi-square distribution.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Uncertainty in Expert Systems
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,
CAUSALITY ASSESSMENT OF SUSPECTED AEs Dr. Retesh Kumar Head, Global PhV Department 12/13/2015.
Coffee and Cardiovascular Disease
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Panel Discussion on Granular Computing at RSCTC2004 J. T. Yao University of Regina Web:
Data Mining and Decision Support
Discovery Challenge – ECML/PKDD2004 September 20, 2004, Pisa, Italy Atherosclerosis Marie Tomečková EuroMISE Centre – Cardio Institute of Computer Science,
IE241 Final Exam. 1. What is a test of a statistical hypothesis? Decision rule to either reject or not reject the null hypothesis.
1 Mining Episode Rules in STULONG dataset N. Méger 1, C. Leschi 1, N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS Université d’Orsay – LRI.
Chapter 12 Chi-Square Tests and Nonparametric Tests.
SESSION 16: FACTORS FOR FINANCIAL SUCCESS AND EARNING INCOME Talking Points Factors for Financial Success 1. Financial success depends on learning to manage.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Hypothesis testing. Inferential statistics Estimation Hypothesis testing.
Chi-Square Chapter 14. Chi Square Introduction A population can be divided according to gender, age group, type of personality, marital status, religion,
AP STATISTICS COMPARING TWO PROPORTIONS Chapter 22.
Peripheral Artery Disease in Orthopaedic Patients with Asymptomatic Popliteal Artery Calcification on Plain X-ray Adam Podet, MS; Julia Volaufova, phD,;
By Dr Hidayathulla Shaikh.  At the end of the lecture student should be able to -  mention steps in methodology of a study  Discuss steps in methodology.
Comparing Two Proportions Chapter 21. In a two-sample problem, we want to compare two populations or the responses to two treatments based on two independent.
Presented by Slyter Nutrition Consulting Services.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
CS 9633 Machine Learning Explanation Based Learning
Naive Bayes Classifier
AP Statistics Comparing Two Proportions
Data Mining Lecture 11.
Propositional Calculus: Boolean Algebra and Simplification
A Modified Naïve Possibilistic Classifier for Numerical Data
Chapter 10 Analyzing the Association Between Categorical Variables
Inferential statistics Study a sample Conclude about the population Two processes: Estimation (Point or Interval) Hypothesis testing.
Presentation transcript:

SDS-Rules and Classification Tomáš Karban ECML/PKDD 2003 – Dubrovnik (Cavtat) September 22, 2003

2 Motivation for SDS-rules STULONG project ( Middle-aged men studied with respect to atherosclerosis (heart disease) risk factors Studying differences between normal group and risk group of patients Compare groups of patients with different physical, social, family and biochemical background Searching for couples of sets that differ markedly in the selected property

3 SDS-Rules SDS-Rules can be understood as an extension to association rules SDS-rules have the form  ( , ,  ) ,  define two disjoint sets A and B  defines some property symbol  stands for SDS-quantifier, which defines relation of two sets in the property 

4 Extend the Four-Fold Table   ab  cd (  )(  ) ef Table of frequencies is extended to six-fold: objects outside the sets A and B

5 distribution of the property  differs in absolute value between set A and set B by more than p and both sets have reasonable size SDS-Quantifiers Symmetric Additive Difference With this quantifier we are able to find those couples of sets, which differ significantly in the property 

6 Syntactical rule for  and  : one common attributed forced Example 1: smoking(no) & beer(half liter a day) smoking(5-10 cigarettes) & coffee(2 cups a day) Example 2: smoking(no) & beer(half liter a day) coffee(2 cups a day) & BMI(>25) Definition of Disjoint Sets

7 Analytical Questions Are there any strong relations concerning entry examination and cause of death? Are there differences in entry examination between men of the risk group, who came down with observed cardiovascular disease (during control examinations) and those who stayed healthy?

8 SDS Results (1) If we compare the group of patients, who are divorced, have reached apprentice school education and have other responsibility in their jobs, with the second group of patients, who are already pensioners, there is a 53.8% difference in the presence of other cause of death.

9 SDS Results (2) If we compare the group of patients, who came down with some cardiovascular disease during the control checks, with those, who stayed healthy, we see that in the second group there were 3.97% more patients working in a managerial position.

10 SDS Results (3) Comparing the group of patients, who do not drink beer and have BMI index equal to or greater than 27, against the group, where patients drink more than 1 liter of beer a day and have cholesterol level between 200 and 250mg, we can see that there are 36.0% more patients coming down with some cardiovascular disease in the first group.

11 Conclusion for SDS-rules There are virtually hundreds or thousands of SDS- rules in every presented task. SDS-rules of one task are often very similar How much is some particular attribute important in cedent conjunction? “SDS-rule neighborhood browsing” Semi-automatically generalize or refine acquired knowledge Attributes were divided into logical groups, inter- group relations were not studied; consult an expert if there is some important problem not covered

12 Classification Estimate death cause based on the attributes from entry examination Estimate, if the patient stayed healthy during control examinations, based on the attributes from entry examination Weka was used J48 decision tree and rules, neural net, Bayes classifier + stacking

13 Classification Results (1) Poor results: All models tend to estimate cause of death for all patients to the biggest class – tumorous disease (29,92%)  Insufficient information to successfully estimate cause of death

14 Classification Results (2) Estimating of staying healthy during control examinations was the same failure successfulness comparable to the size of biggest class (those, who stayed healthy) – approx. 66%  Insufficient information to successfully estimate, if the patient stayed healthy

15 References Hájek, Havránek: Mechanizing Hypothesis Formation – Mathematical Foundations for a General Theory (Springer-Verlag, 1978) Rauch, Šimůnek: Alternative Approach to Mining Association Rules (in proceedings of the workshop ICDM02, Japan, 2002) The STULONG project:

16 SDS Results (1)

17 SDS Results (2)

18 SDS Results (3)