Download presentation
1
Final Project ED Modeling & Prediction
Zhongtian Qiu Hi everyone, my name is Fred from CS department. I’m glad to share my idea of project with you.
2
Big Picture of The Project
Know Your Data Prediction Know Ultimate Goal Select Proper Model Select best parameters Goals -> Prediction Then we have reason to believe we find “the best” model and make a confident prediction. Before getting started, I believe it’s rational to have a big picture of the whole project so I sum up a flow chart. We need to be clear about our ultimate goal and know the characteristics of our data. Then select a proper model and set best parameters in order to avoid overfitting stuff. If we work out this part, I guess I have reason to believe I find “the best” model to my knowledge and thus could make a confident prediction.
3
Know Your Data Many categorical variables
The Admit contains 3.125%“1” vs % “0” How to deal with missing data …… I guess everyone is very clear about the goal so I will start from the data. Our data is not that perfect. Many of them are categorical data and the Admit column contains really big scale of “0” so I supposed data preprocessing shall be one of the most important part of our experiment. And most importantly, we have many missing spots in the excel.
4
Data Preprocessing Missing Value Replacement
Where is the missing data? What is the meaning of the data? What value should be the replacement? So, how to deal with the missing data? We need to clarify three things…(read PPT)
5
Missing value proportion
Basically, I sum up a few variables which missed a large scale of data. I divided them into two groups. First, like nausea_score, we can interpret such missing data as “they didn’t have nausia issue so that spot is empty” or we know nothing about this so this feature contributes nothing to the results. In this way, I believe we can replace missing value with 0. Nothing means “0” so “0” could represent nothing.
6
Missing value proportion
However, for variables like distance and average income, I think such numeric data means a lot to the whole model and we cannot arbitrarily replace it with 0. Instead, I’d love to use mean or median for the missing spots. Those numeric data shall not be neglected as “0”. Mean or Median might be a good idea.
7
Data Preprocessing Exclude features
Fisher Score Admission_Date Referral_Diagnosis_1 Referral_Diagnosis_2 GP Code Test_A Test_F Test_G Chi-Square Specialist_Visits Pain_Score Symptom2Visit_Days Hospital_Admisisons Test_E Gender Mutual Information Spearman Correlation Kendall Correlation Pearson Besides missing data, I tried several ways of Filter Based Features Selection and did some research on our variables. As a result, I found Test_F is the one I am confident enough to kick it out of our features. You can see it clearly through MI and Chi-Square, which are friendly to categorical data. Test_F contains one value which means it contributes nothing to the final model. As for others, interestingly, even though some of them only contains part of the valid data, the AUC will be largely influenced if I remove those variables.
8
Model Performance I sum up the performance of each model and it turned out that 2-class boosted decision tree is the best model that could reach about 98.5% AUC, which is a significant advantage over the rest of the models. Then I tried to set up the best parameters in order to avoid overfitting.
9
Avoid Overfitting I will skip talking about the parameter choose for number of leaves per tree and number of samples per leaf node. I will focus on the combination of iteration and learning rate. You can see from this graph here, the larger the learning rate, the earlier it gets overfitting, obviously. Especially for LR=1, the curve drops significantly after overfitting. We can see LR=0.01 and the iteration around 500 is the best one among all. So I tried more iteration options and it ends up like this. (Show picture) You can see the peak is around 500 iters to 800 iters and then the model become apparently overfitting.
10
Intro to Boosted DT Since I use boosted decision tree, I will intuitively talk about the algorithm of it. Let’s say we have a square and we can only divide it by axis-aligned because of decision tree. First intersection is here and we find we misclassified three “plus” so we have to enlarge the weight of these three to ensure that next time these three to be classified correctly. And next time we do the same thing, making sure 3 plus works great but misclassifies other points, re-weight them again. And do it again. After three steps, we combines those decision boundary together by giving them an “alpha” in accordance with its correction rate every round. Then you will see we have a perfect decision boundary out there.
11
Summary Recall: around 0.55 Precision: around 0.65
F1 Score: around 0.597 Accuracy: 97.7% AUC: 98.7% Two-Class Boosted Decision Tree Then we can make our prediction! Summary. That’s pretty much about my experiment! Thank you all!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.