Group 7 • Shing • Gueye • Thakur

Slides:



Advertisements
Similar presentations
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Using data sets to simulate evolution within complex environments Bruce Edmonds Centre for Policy Modelling Manchester Metropolitan University.
1 Statistical Modeling  To develop predictive Models by using sophisticated statistical techniques on large databases.
Independent & Dependent Variables
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Introduction to Data Mining with XLMiner
A Classification Approach for Effective Noninvasive Diagnosis of Coronary Artery Disease Advisor: 黃三益 教授 Student: 李建祥 D 楊宗憲 D 張珀銀 D
Data Mining.
Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Aim : Develop prediction model that can be used to facilitate clinicians in targeting patients at high or low risk of mortality. Method : Logistic Regression.
Multiple Regression – Basic Relationships
Data Mining: A Closer Look
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
Decision Tree Models in Data Mining
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic
Multiple Choice Questions for discussion
Chapter 9 – Classification and Regression Trees
Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
Reducing the Response Time for Data Warehouse Queries Using Rough Set Theory By Mahmoud Mohamed Al-Bouraie Yasser Fouad Mahmoud Hassan Wesam Fathy Jasser.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Tessy Badriyah Healthy Computing, 1 nd June 2011 Aim : to contribute to the building of effective and efficient methods to predict clinical outcome that.
Scientific Methods and Terminology. Scientific methods are The most reliable means to ensure that experiments produce reliable information in response.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
1 SSC 2006: Case Study #2: Obstructive Sleep Apnea Rachel Chu, Shuyu Fan, Kimberly Fernandes, and Jesse Raffa Department of Statistics, University of British.
(Slides not created solely by me – the internet is a wonderful tool) SW388R7 Data Analysis & Compute rs II Slide 1.
Modeling of Core Protection Calculator System Software February 28, 2005 Kim, Sung Ho Kim, Sung Ho.
1 Context-aware Data Mining using Ontologies Sachin Singh, Pravin Vajirkar, and Yugyung Lee Springer-Verlag Berlin Heidelberg 2003, pp Reporter:
Chapter 15 Analysis of Variance. The article “Could Mean Platelet Volume be a Predictive Marker for Acute Myocardial Infarction?” (Medical Science Monitor,
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Statistical Significance for 2 x 2 Tables Chapter 13.
Investigating the Relationship between Incoming A Level Grades and Final Degree Classification Tomas James Introduction Universities use A Levels as the.
Stats Methods at IC Lecture 3: Regression.
Bootstrap and Model Validation
A Generic Approach to Big Data Alarms Prioritization
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Group 7 Hospital Readmission Predictive Analytics
. Troponin limit of detection plus cardiac risk stratification scores for the exclusion of myocardial infarction and 30-day adverse cardiac events in ED.
Effects of User Similarity in Social Media Ashton Anderson Jure Leskovec Daniel Huttenlocher Jon Kleinberg Stanford University Cornell University Avia.
Analyze ICD-10 Diagnosis Codes with Stata
26134 Business Statistics Week 5 Tutorial
Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.
“The Bulgarians stand at the basis of human civilization
Multivariate Analysis - Introduction
Applied Biostatistics: Lecture 2
Medical Diagnosis via Genetic Programming
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Correlation and Regression
Advanced Analytics Using Enterprise Miner
NBA Draft Prediction BIT 5534 May 2nd 2018
Employee Turnover: Data Analysis and Exploration
Finding Answers through Data Collection
Predicting Government Spending on Professional Services
S1316 analysis details Garnet Anderson Katie Arnold
Stats Club Marnie Brennan
An Inteligent System to Diabetes Prediction
11/20/2018 Study Types.
The Scientific Method C1L1CP1 How do scientists work?
Two Categorical Variables: The Chi-Square Test
An Introduction to Correlational Research
15.1 The Role of Statistics in the Research Process
Title of your experimental design
Multiple Regression – Split Sample Validation
DESIGN OF EXPERIMENTS by R. C. Baker
Effect of Sample size on Research Outcomes
Presentation transcript:

Group 7 • Shing • Gueye • Thakur Helping Healthy Hearts: Analyzing Factors that Contribute to Heart Disease Group 7 • Shing • Gueye • Thakur Welcome to the Group 7 Data presentation. Our project focuses on factors that affect heart health.

A healthcare provider has a datastore with patients’ health variables and information. They would like to leverage data mining to find patterns within the dataset to identify which factors cause heart disease. Business Problem Our business problem is about a healthcare provider that has collected patient heart health data including age, gender, if the patient has experienced chest pains and different test results related to the heart. So far, the provider has been looking at the data ad-hoc. The healthcare provider would like to start using data mining in a strategic manner to help their doctors and patients find information to identify which factors cause heart disease. This information may potentially be used to develop predictive analytic and reporting tools that can predict the likelihood of patients at risk of heart disease. For our project, we will be analyzing the data that the healthcare provider has collected to find patterns within the dataset to identify which factors cause heart disease.

Statistical Goals Find data to support or reject our H0: H0 = heart health variables do not affect the likelihood of getting heart disease Investigate the results: Which variables are associated to one another that ultimately lead to developing heart disease? Are there single variables that are statistically significant enough that they alone can lead to developing heart disease or are groups of these variables more reliable? Statistical Goals The statistical goals are the questions we are trying to answer for this project. First they’d like to know if any of the variables in the data set are correlated with having heart disease. This is our null hypothesis: heart health variables do not affect the likelihood of getting heart disease. If the null hypothesis is rejected, the health care providers will want to know which factors in particular are leading indicators for heart disease. Here we developed the following questions:

Our Analysis Process Improved with Neural Network Found statistically significant data to reject H0 Created validation and training data sets Started with Whole Model Test Data Acquisition Data Partition Data Transformation Model Analysis Results Used Cleveland Clinic Foundation to UCI data Refined using Decision Tree Analysis - Removed missing records - Transformed continuous variables to categorical ones Here we show our data mining process. First, we gathered data from Professor James Gareth’s educational page from the University of Southern California. THe original source of the data is the Cleveland Clinic Foundation. After aquiring the data, to partition the data, we created validation and training data sets and also cleaned up any missing records. We also transformed certain continuous variables to categorical ones in order ease our data analysis. For our data analysis, we applied three models: our first model was the whole model test or logistic regression. Our second model used decision tree analysis for classification while our third model utilized artificial neural networks was for refinement. Our results would then either accept or reject our null hypothesis.

Data Characteristics Donated by the Cleveland Clinic Foundation to the UCI data repository 2013 heart patient data 303 male and female healthcare patients At least 14 attributes of each patient Numerical, categorical, and text data types Our dataset was found in the UCI’s public data repository for educational purposes. This data was recorded for the year 2013 and includes 303 patients both male and female. There are at least 14 attributes for each patient including numerical, categorical, and text data types.

Understanding the data Variable Data Type Description age Date age - age in years sex Categorical – nominal sex - sex (1 = male; 0 = female) cp chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic) trestbps Numeric resting blood pressure (in mm Hg on admission to the hospital) chol serum cholesterol in mg/dl fbs fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy) thalach maximum heart rate achieved exang exercise induced angina (1 = yes; 0 = no) oldpeak ST depression induced by exercise relative to rest slope the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping) ca number of major vessels (0-3) colored by fluoroscopy thal 3 = normal; 6 = fixed defect; 7 = reversible defect AHD the predicted attribute - diagnosis of heart disease (angiographic disease status) Yes = 1; No = 0) Here we have our variable dictionary from the data set. Users can find the data type and the classification description here in this table. As you can see, we have standard variables such as age and sex but we also have health attributes such as types of chest pains, and measurements of different tests that were performed on the patient that were recorded in our dataset when the patient was being monitored for heart disease. The most important variable here is the last one, AHD. AHD is the diagnosis of heart disease it is a categorical variable represented by a binary statement 1 for yes and 0 for no. This is the response or dependent variable while all the other variables are the factors or independent variables. Throughout our analytical process, we are trying to find either single factors or compound factors that include multiple independent variables that influence the AHD variable.

Data Analysis: Logistic Regression Model Purpose Our team started with logistic regression to find which (if any) variables could have a statistically significant relationship with our target This is a simple analysis that could provide guidance about proceeding with the data analysis Results This model showed an accuracy rate of 84.2% The relationship between our target variable (AHD) and the variables Ca, Thal, cp, and Oldpeak was shown to be statistically significant for most values of alpha First, we started with the whole model test using logistic regression. Our main purpose behind logistic regression was to gather information about our null hypothesis. Could we find conclusively that these independent variables did in fact affect our dependent variable? Our answer is yes. This model showed an accuracy rate of 84.2%. Through logistic regression, we found four different variables showed statistically significant relationships with the dependent variable: Ca or number of major vessels colored by flouroscopy, Thal defect of the heart classified as normal, fixed, or reversible, cp as the type of chest pain and OldPeak ST deperession induced by exercise relative to rest.

Data Analysis: Decision Tree Purpose This model classifies into groups which is very useful with identifying factors that affect the presence of heart disease Results This model had a accuracy of 86.2% on the validation data The model on right provides a good visual tool for classifying individual data points based on Chest Pain, Ca, and Thal Bolstered by the findings in our first model, we moved to the decision tree model. This model was chosen because we hoped to target very specific behaviors for the patients. This model classifies factors into groups which is useful with identifying factors that affect the presence of heart disease. The decision tree model also performed very well having an accuracy of 86.2% on the validation data and made 4 different splits. As you can see in the diagram, not having chest pain and having a Ca value equal to 1 can be a very strong predictor of the disease. Similarly, having chest pain but having a thal value of 0 seems to be a good predictor for not having the disease.

Data Analysis: Artificial Neural Network Purpose We used a neural network to see if we could further refine our analysis Results This model had an accuracy of 80% with our validation data and 86.8% on the training data This model doesn’t allow us to understand variable effects so that must be gathered from previous models. Our final model was the artificial neural network. This kind of model has gained a lot of attention in the media due to its ability to predict very complex solutions.We used this model to see if we could further refine our analysis. This model performed similarly to our other models in terms of accuracy having an accuracy of 80% with the validation data and 86.8% on the training data. It did slightly better on the training data than the transformation data. Because artificial neural networks act like black boxes, it is difficult to assess the effects of individual variables like we did for the previous models. However, neural networks have an ability to “learn” through iterations and an increased amount of data making this model very attractive for our purpose of predicting heart disease.

Model Comparison Logistic Regression Decision Tree Analysis Accuracy: 84.1% Area Under Curve: 0.913 Decision Tree Analysis Accuracy: 86.2% Area Under Curve: 0.864 Neural Network Accuracy: 86.8% Area Under Curve: 0.855 In comparing models, we would like to maximize accuracy and the area under the curve for the validation dataset. Logistic regression provides the best numbers for our limited dataset but as the data continues to grow, a neural network may be more efficient. Here we compare the results across all three of our models. In this case, we are using two metrics, the overall accuracy of the models and the area under the curve of the ROC graph or AUC of the models. We want to maximize accuracy and the area under the curve for the validation dataset. While the neural network model performed best in overall accuracy, the other models were not far behind. All three models also performed well enough for the AUC metric having such a high accuracy percentage and AUC value so our decision to pick the final model is a result of the kinds of analysis that best fits the business use case. Although logistic regression provided the best numbers for our limited dataset, as the data continues to grow, a neural network may be more efficient.

We reject our H0. There is a relationship between independent variables and the presence of heart disease All three models performed quite well. We chose decision tree and neural network as the final choices to accommodate larger data sets. Data Findings Restating our null hypothesis that health related variables do not affect the likelihood of getting heart disease, we reject our null hypothesis because there is a relationship between the independent or factor variables and the presence of heart disease. Looking back, all three models performed well with a high accuracy and AUC value. Our choice for the final model was the neural network with assistance from the classification tree. The tree can help identify factors that work together to increase heart risk allowing access to better information for doctors and patients alike. With a much larger amount data vs the 303 patients in our dataset, the neural network can be honed to predict more effectively for even larger datasets that logistic regression cannot.

Analytical Recommendations Patients with a fluoroscopy (Ca) count of at least 1 are more likely to be diagnosed with Atherosclerotic Heart Disease than those without Measurements such as Thal and Chest-pain (cp) were also helpful in identifying patients with AHD Doctors can use this information to monitor their patients more closely in terms of these attributes. Analytical Recommendations Finally, we had a few interesting relationships that emerged from within the dataset. The fluroscopy count and Thal measurement seem to play key roles in predicting heart disease. Patients with a fluoroscopy count of at least 1 aremore likely to be diagnosed with Atherosclerotic Heart diesease than those without. Measurements such as Thal and chestpain were also indications of a patient suffering from heart disease.