Group 7 • Shing • Gueye • Thakur

Group 7 • Shing • Gueye • Thakur
Helping Healthy Hearts: Analyzing Factors that Contribute to Heart Disease Group 7 • Shing • Gueye • Thakur Welcome to the Group 7 Data presentation. Our project focuses on factors that affect heart health.

A healthcare provider has a datastore with patients’ health variables and information. They would like to leverage data mining to find patterns within the dataset to identify which factors cause heart disease. Business Problem Our business problem is about a healthcare provider that has collected patient heart health data including age, gender, if the patient has experienced chest pains and different test results related to the heart. So far, the provider has been looking at the data ad-hoc. The healthcare provider would like to start using data mining in a strategic manner to help their doctors and patients find information to identify which factors cause heart disease. This information may potentially be used to develop predictive analytic and reporting tools that can predict the likelihood of patients at risk of heart disease. For our project, we will be analyzing the data that the healthcare provider has collected to find patterns within the dataset to identify which factors cause heart disease.

Statistical Goals Find data to support or reject our H0:
H0 = heart health variables do not affect the likelihood of getting heart disease Investigate the results: Which variables are associated to one another that ultimately lead to developing heart disease? Are there single variables that are statistically significant enough that they alone can lead to developing heart disease or are groups of these variables more reliable? Statistical Goals The statistical goals are the questions we are trying to answer for this project. First they’d like to know if any of the variables in the data set are correlated with having heart disease. This is our null hypothesis: heart health variables do not affect the likelihood of getting heart disease. If the null hypothesis is rejected, the health care providers will want to know which factors in particular are leading indicators for heart disease. Here we developed the following questions:

Our Analysis Process Improved with Neural Network
Found statistically significant data to reject H0 Created validation and training data sets Started with Whole Model Test Data Acquisition Data Partition Data Transformation Model Analysis Results Used Cleveland Clinic Foundation to UCI data Refined using Decision Tree Analysis - Removed missing records - Transformed continuous variables to categorical ones Here we show our data mining process. First, we gathered data from Professor James Gareth’s educational page from the University of Southern California. THe original source of the data is the Cleveland Clinic Foundation. After aquiring the data, to partition the data, we created validation and training data sets and also cleaned up any missing records. We also transformed certain continuous variables to categorical ones in order ease our data analysis. For our data analysis, we applied three models: our first model was the whole model test or logistic regression. Our second model used decision tree analysis for classification while our third model utilized artificial neural networks was for refinement. Our results would then either accept or reject our null hypothesis.

Data Characteristics Donated by the Cleveland Clinic Foundation to the UCI data repository 2013 heart patient data 303 male and female healthcare patients At least 14 attributes of each patient Numerical, categorical, and text data types Our dataset was found in the UCI’s public data repository for educational purposes. This data was recorded for the year 2013 and includes 303 patients both male and female. There are at least 14 attributes for each patient including numerical, categorical, and text data types.

Understanding the data
Variable Data Type Description age Date age - age in years sex Categorical – nominal sex - sex (1 = male; 0 = female) cp chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic) trestbps Numeric resting blood pressure (in mm Hg on admission to the hospital) chol serum cholesterol in mg/dl fbs fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy) thalach maximum heart rate achieved exang exercise induced angina (1 = yes; 0 = no) oldpeak ST depression induced by exercise relative to rest slope the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping) ca number of major vessels (0-3) colored by fluoroscopy thal 3 = normal; 6 = fixed defect; 7 = reversible defect AHD the predicted attribute - diagnosis of heart disease (angiographic disease status) Yes = 1; No = 0) Here we have our variable dictionary from the data set. Users can find the data type and the classification description here in this table. As you can see, we have standard variables such as age and sex but we also have health attributes such as types of chest pains, and measurements of different tests that were performed on the patient that were recorded in our dataset when the patient was being monitored for heart disease. The most important variable here is the last one, AHD. AHD is the diagnosis of heart disease it is a categorical variable represented by a binary statement 1 for yes and 0 for no. This is the response or dependent variable while all the other variables are the factors or independent variables. Throughout our analytical process, we are trying to find either single factors or compound factors that include multiple independent variables that influence the AHD variable.

Data Analysis: Logistic Regression Model
Purpose Our team started with logistic regression to find which (if any) variables could have a statistically significant relationship with our target This is a simple analysis that could provide guidance about proceeding with the data analysis Results This model showed an accuracy rate of 84.2% The relationship between our target variable (AHD) and the variables Ca, Thal, cp, and Oldpeak was shown to be statistically significant for most values of alpha First, we started with the whole model test using logistic regression. Our main purpose behind logistic regression was to gather information about our null hypothesis. Could we find conclusively that these independent variables did in fact affect our dependent variable? Our answer is yes. This model showed an accuracy rate of 84.2%. Through logistic regression, we found four different variables showed statistically significant relationships with the dependent variable: Ca or number of major vessels colored by flouroscopy, Thal defect of the heart classified as normal, fixed, or reversible, cp as the type of chest pain and OldPeak ST deperession induced by exercise relative to rest.

Data Analysis: Decision Tree
Purpose This model classifies into groups which is very useful with identifying factors that affect the presence of heart disease Results This model had a accuracy of 86.2% on the validation data The model on right provides a good visual tool for classifying individual data points based on Chest Pain, Ca, and Thal Bolstered by the findings in our first model, we moved to the decision tree model. This model was chosen because we hoped to target very specific behaviors for the patients. This model classifies factors into groups which is useful with identifying factors that affect the presence of heart disease. The decision tree model also performed very well having an accuracy of 86.2% on the validation data and made 4 different splits. As you can see in the diagram, not having chest pain and having a Ca value equal to 1 can be a very strong predictor of the disease. Similarly, having chest pain but having a thal value of 0 seems to be a good predictor for not having the disease.

Data Analysis: Artificial Neural Network
Purpose We used a neural network to see if we could further refine our analysis Results This model had an accuracy of 80% with our validation data and 86.8% on the training data This model doesn’t allow us to understand variable effects so that must be gathered from previous models. Our final model was the artificial neural network. This kind of model has gained a lot of attention in the media due to its ability to predict very complex solutions.We used this model to see if we could further refine our analysis. This model performed similarly to our other models in terms of accuracy having an accuracy of 80% with the validation data and 86.8% on the training data. It did slightly better on the training data than the transformation data. Because artificial neural networks act like black boxes, it is difficult to assess the effects of individual variables like we did for the previous models. However, neural networks have an ability to “learn” through iterations and an increased amount of data making this model very attractive for our purpose of predicting heart disease.

Model Comparison Logistic Regression Decision Tree Analysis
Accuracy: 84.1% Area Under Curve: 0.913 Decision Tree Analysis Accuracy: 86.2% Area Under Curve: 0.864 Neural Network Accuracy: 86.8% Area Under Curve: 0.855 In comparing models, we would like to maximize accuracy and the area under the curve for the validation dataset. Logistic regression provides the best numbers for our limited dataset but as the data continues to grow, a neural network may be more efficient. Here we compare the results across all three of our models. In this case, we are using two metrics, the overall accuracy of the models and the area under the curve of the ROC graph or AUC of the models. We want to maximize accuracy and the area under the curve for the validation dataset. While the neural network model performed best in overall accuracy, the other models were not far behind. All three models also performed well enough for the AUC metric having such a high accuracy percentage and AUC value so our decision to pick the final model is a result of the kinds of analysis that best fits the business use case. Although logistic regression provided the best numbers for our limited dataset, as the data continues to grow, a neural network may be more efficient.

We reject our H0. There is a relationship between independent variables and the presence of heart disease All three models performed quite well. We chose decision tree and neural network as the final choices to accommodate larger data sets. Data Findings Restating our null hypothesis that health related variables do not affect the likelihood of getting heart disease, we reject our null hypothesis because there is a relationship between the independent or factor variables and the presence of heart disease. Looking back, all three models performed well with a high accuracy and AUC value. Our choice for the final model was the neural network with assistance from the classification tree. The tree can help identify factors that work together to increase heart risk allowing access to better information for doctors and patients alike. With a much larger amount data vs the 303 patients in our dataset, the neural network can be honed to predict more effectively for even larger datasets that logistic regression cannot.

Analytical Recommendations
Patients with a fluoroscopy (Ca) count of at least 1 are more likely to be diagnosed with Atherosclerotic Heart Disease than those without Measurements such as Thal and Chest-pain (cp) were also helpful in identifying patients with AHD Doctors can use this information to monitor their patients more closely in terms of these attributes. Analytical Recommendations Finally, we had a few interesting relationships that emerged from within the dataset. The fluroscopy count and Thal measurement seem to play key roles in predicting heart disease. Patients with a fluoroscopy count of at least 1 aremore likely to be diagnosed with Atherosclerotic Heart diesease than those without. Measurements such as Thal and chestpain were also indications of a patient suffering from heart disease.

Group 7 • Shing • Gueye • Thakur

Similar presentations

Presentation on theme: "Group 7 • Shing • Gueye • Thakur"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Group 7 • Shing • Gueye • Thakur

Similar presentations

Presentation on theme: "Group 7 • Shing • Gueye • Thakur"— Presentation transcript:

Similar presentations

About project

Feedback