Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková.

Slides:

Advertisements

Similar presentations

Groupe de travail athérosclérose 1 STULONG Discovery Challenges Feedback Marie Tomečková EuroMISE – Cardio This work is supported by the project LN00B107.

Advertisements

How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.

STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.

EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.

Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

Two-Way Tables Two-way tables come about when we are interested in the relationship between two categorical variables. –One of the variables is the row.

Statistics 303 Chapter 9 Two-Way Tables. Relationships Between Two Categorical Variables Relationships between two categorical variables –Depending on.

Review: The Logic Underlying ANOVA The possible pair-wise comparisons: X 11 X 12. X 1n X 21 X 22. X 2n Sample 1Sample 2 means: X 31 X 32. X 3n Sample 3.

Today Concepts underlying inferential statistics

Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.

Chapter 14 Inferential Data Analysis

Statistical hypothesis testing – Inferential statistics II. Testing for associations.

1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.

AM Recitation 2/10/11.

Global impact of ischemic heart disease World Heart Federation, 2011.

Chapter 4 Hypothesis Testing, Power, and Control: A Review of the Basics.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.

Fundamentals of Statistical Analysis DR. SUREJ P JOHN.

The effects of initial and subsequent adiposity status on diabetes mellitus Speaker: Qingtao Meng. MD West China hospital, Chendu, China.

TIME SERIES by H.V.S. DE SILVA DEPARTMENT OF MATHEMATICS

Regular exercise and SCORE risk in obese type 2 diabetic patients Autor: Milan Tatić Mentor: Prof. dr Slobodan Antić.

Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.

Biostatistics Case Studies Peter D. Christenson Biostatistician Session 5: Analysis Issues in Large Observational Studies.

Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Hypothesis Testing PowerPoint Prepared by Alfred.

Risk Factors for Cardiovascular Disease

Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.

X Treatment population Control population 0 Examples: Drug vs. Placebo, Drugs vs. Surgery, New Tx vs. Standard Tx  Let X =  cholesterol level (mg/dL);

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 8 – Comparing Proportions Marshall University Genomics.

1ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino.

Analysis of Death Causes in the STULONG Data Set Jan Burian, Jan Rauch EuroMISE – Cardio University of Economics Prague.

Associate Professor Arthur Dryver, PhD School of Business Administration, NIDA url:

A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.

Chapter 10 Correlation and Regression

1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.

Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.

1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.

Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 4: Study Size and Power.

Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size and Power.

The Statistical Analysis of Data. Outline I. Types of Data A. Qualitative B. Quantitative C. Independent vs Dependent variables II. Descriptive Statistics.

Analysis of Two-Way tables Ch 9

Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.

Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.

Association between 2 variables We've described the distribution of 1 variable - but what if 2 variables are measured on the same individual? Examples?

ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,

CAUSALITY ASSESSMENT OF SUSPECTED AEs Dr. Retesh Kumar Head, Global PhV Department 12/13/2015.

PCB 3043L - General Ecology Data Analysis.

URBDP 591 I Lecture 4: Research Question Objectives How do we define a research question? What is a testable hypothesis? How do we test an hypothesis?

1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.

Discovery Challenge – ECML/PKDD2004 September 20, 2004, Pisa, Italy Atherosclerosis Marie Tomečková EuroMISE Centre – Cardio Institute of Computer Science,

Probability and odds Suppose we a frequency distribution for the variable “TB status” The probability of an individual having TB is frequencyRelative.

26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Lecture 3 – Sep 3. Normal quantile plots are complex to do by hand, but they are standard features in most statistical software. Good fit to a straight.

Paul Fryers Deputy Director, EMPHO Technical Advisor, APHO Introduction to Correlation and Regression Contributors Shelley Bradley, EMPHO Mark Dancox,

Dr. Nadira Mehriban. INTRODUCTION Diabetic retinopathy (DR) is one of the major micro vascular complications of diabetes and most significant cause of.

Meta-analysis of observational studies Nicole Vogelzangs Department of Psychiatry & EMGO + institute.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.

The short term effects of metabolic syndrome and its components on all-cause-cause mortality-the Taipei Elderly Health Examination Cohort Wen-Liang Liu.

SDS-Rules and Classification Tomáš Karban ECML/PKDD 2003 – Dubrovnik (Cavtat) September 22, 2003.

STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.

26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.

Online Conditional Outlier Detection in Nonstationary Time Series

Do Age, BMI, and History of Smoking play a role?

Type 2 diabetes: Overlap of clinical conditions

Cardiovascular disease: Leading cause of death

Nature of Science.

Baseline Characteristics of the Subjects*

Presentation transcript:

Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková PKDD 2004, Discovery Challenge Department of Cybernetics, Czech Technical University, Prague

Outline Previous CTU entry –subgroup discovery (ENTRY), general CVD model –trend analysis: global approach vs. windowing Role of windowing in mining trends –KM, Cox models in medicine –(symbolic) temporal trends in data mining Development of windowing approach –temporal CVD definition –role of the window length –multi-feature interactions Ordinal association rules –processing of the windowed features

STULONG Data Four tables: Entry, Control, Letter, Death Dependent variable: (static) CVD –CardioVascular Diseases –Boolean attribute derived of A 2 questionnaire (Control table) CVD = false The patient has no coronary disease. CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14) We remove patients who have diabetes (Hodn4) or cancer (Hodn15) only. positive angina pectoris (silent) myocardial infarction cerebrovascular accident ischemic heart disease

ENTRY - subgroup discovery AQ no.6: Are there any differences in the ENTRY examination for different CVD groups? Statistica 6.0 –module for interactive decision tree induction –two tailed t-test or chi-square test to asses significance of subgroups Dependencies are relatively weak Interesting dependencies found –social characteristics: derived attribute AGE_of_ENTRY –alcohol: “positive effect” of beer, no effect of wine –sugar consumption increases CVD risk –well-known dependencies are not mentioned (smoking, BMI, cholesterol)

ENTRY - general model General CVD model (in WEKA) –feature selection + modeling (e.g., decision trees) –tends to generate trivial models (always predicting false) –asymmetric error-cost matrix does not help Predict CVD risk –Identify principal variables (Chi-squared test) –Naïve Bayes + ROC evaluation –three independent variables –discretized AGE_of_ENTRY –discretized BMI –Cholrisk - derived of CHLST –AUC = 0.66

CONTROL - trend analysis AQ no.7: Are there any differences in development of risk factors for different CVD groups? –increasing BMI makes a contribution to CVD appearance ENTRY tableCONTR table ICO – primary key Year of birth Year of entry Smoking Alcohol Cholesterol Body Mass Index Blood pressure ICO Risk factors followed during 20 years

Motivation focus on development – trend gradients possibilities –contemporary statistical methods used in medicine KM, Cox models – analyze sth else than we want ANOVA etc. – features have to be developed anyway, lack of data –complex sequential data mining introduction of structural patterns and then e.g., association rules interesting but again needs more data our approach –introduction of simple aggregates –application of windowing –statistical evaluation for simple dependencies –ordinal association rules for more complex relations

Survival curves Kaplan-Meier or Cox method –typical example of temporal analysis in medicine –regards survival period, BUT disregards development of RFs –typical scenario distinguish groups of patients (ENTRY table) follow their “survival” periods (DEATH or CONTROL table)

Derived trend attributes Intercept Gradient Correlation coefficient Standard deviation x (decimal time ~ year + 1/12 month) y (observed variable) referential time (1975) Mean

Global Approach Risk factors to be observed are selected –SYST, DIAST, TRIGL, BMI, CHLSTMG Selected control examinations are transformed –pivoting Patients with no control entries are removed –about 60 patients Trend aggregates are calculated ICOEntryContr1Contr2Aggr1AggrN... ContrM... ICO_1 ICO_2

Windowing Approach Constant number of examinations for  individuals Issues: –window length time period vs. number of checkups how many checkups to select? 5, 8, 10 tested –single distinct window or sliding window? entry is used as the first examination more records per patient  records are not independent –temporal CVD definition CVDi - time from the last examination to CVD yes/no (yes = CVD in the next year or CVD in future) –missing values treatment

Windowing – missing values approach 1: shift the series approach 2: introduce a new value

Window length selection

3 different lengths tested, 5 risk factors considered compared with the global approach test used, –null hypothesis: independence of trends and CVD –p-values are shown windowing: CVD1 vs. nonCVD group global: CVD vs. nonCVD group Window length effects global approach is completely misleading prefer shorter windows down-up effect prefers longer windows only long term changes may have effect

ControlCount vs. CVD ControlCount –number of examinations –strong relation with CVD –AUC = 0.35 –ControlCount  CVD risk  –anachronistic attribute –introduced by the design of the study ControlCount has influence on the trend aggregates - ControlCount  gradients tend to be more steep etc. Conclusion: global approach cannot be applied (at least with the selected aggregates)

Influence of SYSTGrad (W5) 122 individual CVD1 observations in total SYSTGrad (W5) equi-depth binned in 5 groups representation CVD1 group significantly increases with increasing group number of SYSTGrad

Averaged blood pressure striking difference in CVD1 and nonCVD groups –linear vs. down-up development –can also be observed for the individuals – see the next slide –cannot be distinguished by longer windows

Averaged body mass index difference in CVD1 and nonCVD groups –steady BMI in the nonCVD group –increasing BMI in the CVD1 group –longer windows express this trend better –this graph shows that W10 may benefit from increase between examination 9 and 8

Influence of trend aggregates on CVD –9 gradients considered: SYST, DIAST, CHLSTMG, TRIGLMG, BMI, HDL, LDL, POCCIG and MOC Identified relations –decreasing HDL cholesterol level relates to the increasing risk of CVD (p=0.001) –decreasing POCCIG (the average number of cigarettes smoked per day ) relates to the increasing risk of CVD (p=0.0001) Again: correlation vs. causality –statement 1 makes sense: HDL is a ’good’ cholesterol –statement 2 suggests spurious dependency Trend factors – hypothesis testing patient state cause smoking habits effect 1 CVD onset effect 2

Group a – relations among trend factors –a great prevalence of the rules joining together either blood pressures (DIASTGrad and SYSTGrad) or cholesterol attributes (HLDGrad, LDLGrad and CHLSTGrad) Group b - hypothesis to be verified by experts –insufficient target groups, 6% transactions makes 26 individuals, i.e., instead of 10 prospective diseased patients we actually observe 19 Overview of AR found

Conclusions The main scope –AQ no.7: Are there any differences in development of risk factors for different CVD groups? Contributions –Pitfalls of the global approach revealed –Windowing enabling multivariate temporal analysis proposed, effects of various window lengths studied –Development of the following risk factors may influence future CVD occurrence: DIAST, SYST, BMI, (HDL) cholesterol, (POCCICG) –Other trends may have or intensify their influence under specific conditions (BMI trend and overweight, etc.) – we lack data to prove it