Final Project ED Modeling & Prediction

Slides:



Advertisements
Similar presentations
Developing a coding scheme for content analysis A how-to approach.
Advertisements

Random Forest Predrag Radenković 3237/10
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Brief introduction on Logistic Regression
Imbalanced data David Kauchak CS 451 – Fall 2013.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Topic 12 – Further Topics in ANOVA
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Ensemble Learning: An Introduction
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Three kinds of learning
Correlation A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Evaluation – next steps
Chapter 9 – Classification and Regression Trees
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
1 Javier Aparicio División de Estudios Políticos, CIDE Primavera Regresión.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Reasoning in Psychology Using Statistics Psychology
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted.
Ensemble Methods in Machine Learning
Konstantina Christakopoulou Liang Zeng Group G21
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Regression Trees
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
Using Excel to Graph Data Featuring – Mean, Standard Deviation, Standard Error and Error Bars.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
CPH Dr. Charnigo Chap. 9 Notes To begin with, have a look at Figure 9.5 on page 315. One can get an intuitive feel for how a tree works by examining.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Inferential Statistics
CS 485 Datamining Final Project
Confidence Intervals for Proportions
Bagging and Random Forests
INF397C Introduction to Research in Information Studies Spring, Day 12
Welcome to Week 02 Thurs MAT135 Statistics
Confidence Intervals for Proportions
Simulation-Based Approach for Comparing Two Means
QM222 A1 Nov. 27 More tips on writing your projects
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Roberto Battiti, Mauro Brunato
Standard Deviation.
Data Mining Practical Machine Learning Tools and Techniques
IB BIOLOGY INTERNAL ASSESSMENT
Reasoning in Psychology Using Statistics
Ensembles.
Descriptive Statistics
Model Evaluation and Selection
MIS2502: Data Analytics Clustering and Segmentation
Ensemble learning.
Support Vector Machines
Summary (Week 1) Categorical vs. Quantitative Variables
Reasoning in Psychology Using Statistics
Computational Intelligence: Methods and Applications
Artificial Intelligence 6. Decision Tree Learning
Reasoning in Psychology Using Statistics
Psychological Research Methods and Statistics
Kalman Filter: Bayes Interpretation
Global PaedSurg Research Training Fellowship
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Final Project ED Modeling & Prediction Zhongtian Qiu Hi everyone, my name is Fred from CS department. I’m glad to share my idea of project with you.

Big Picture of The Project Know Your Data Prediction Know Ultimate Goal Select Proper Model Select best parameters Goals -> Prediction Then we have reason to believe we find “the best” model and make a confident prediction. Before getting started, I believe it’s rational to have a big picture of the whole project so I sum up a flow chart. We need to be clear about our ultimate goal and know the characteristics of our data. Then select a proper model and set best parameters in order to avoid overfitting stuff. If we work out this part, I guess I have reason to believe I find “the best” model to my knowledge and thus could make a confident prediction.

Know Your Data Many categorical variables The Admit contains 3.125%“1” vs. 96.875% “0” How to deal with missing data …… I guess everyone is very clear about the goal so I will start from the data. Our data is not that perfect. Many of them are categorical data and the Admit column contains really big scale of “0” so I supposed data preprocessing shall be one of the most important part of our experiment. And most importantly, we have many missing spots in the excel.

Data Preprocessing Missing Value Replacement Where is the missing data? What is the meaning of the data? What value should be the replacement? So, how to deal with the missing data? We need to clarify three things…(read PPT)

Missing value proportion Basically, I sum up a few variables which missed a large scale of data. I divided them into two groups. First, like nausea_score, we can interpret such missing data as “they didn’t have nausia issue so that spot is empty” or we know nothing about this so this feature contributes nothing to the results. In this way, I believe we can replace missing value with 0. Nothing means “0” so “0” could represent nothing.

Missing value proportion However, for variables like distance and average income, I think such numeric data means a lot to the whole model and we cannot arbitrarily replace it with 0. Instead, I’d love to use mean or median for the missing spots. Those numeric data shall not be neglected as “0”. Mean or Median might be a good idea.

Data Preprocessing Exclude features Fisher Score Admission_Date Referral_Diagnosis_1 Referral_Diagnosis_2 GP Code Test_A Test_F Test_G Chi-Square Specialist_Visits Pain_Score Symptom2Visit_Days Hospital_Admisisons Test_E Gender 258.884583 245.396422 242.203222 170.371268 19.626987 0.185827 Mutual Information 0.001685 0.001636 0.001587 0.000886 0.000138 0.000001 Spearman Correlation Kendall Correlation Pearson Besides missing data, I tried several ways of Filter Based Features Selection and did some research on our variables. As a result, I found Test_F is the one I am confident enough to kick it out of our features. You can see it clearly through MI and Chi-Square, which are friendly to categorical data. Test_F contains one value which means it contributes nothing to the final model. As for others, interestingly, even though some of them only contains part of the valid data, the AUC will be largely influenced if I remove those variables.

Model Performance I sum up the performance of each model and it turned out that 2-class boosted decision tree is the best model that could reach about 98.5% AUC, which is a significant advantage over the rest of the models. Then I tried to set up the best parameters in order to avoid overfitting.

Avoid Overfitting I will skip talking about the parameter choose for number of leaves per tree and number of samples per leaf node. I will focus on the combination of iteration and learning rate. You can see from this graph here, the larger the learning rate, the earlier it gets overfitting, obviously. Especially for LR=1, the curve drops significantly after overfitting. We can see LR=0.01 and the iteration around 500 is the best one among all. So I tried more iteration options and it ends up like this. (Show picture) You can see the peak is around 500 iters to 800 iters and then the model become apparently overfitting.

Intro to Boosted DT Since I use boosted decision tree, I will intuitively talk about the algorithm of it. Let’s say we have a square and we can only divide it by axis-aligned because of decision tree. First intersection is here and we find we misclassified three “plus” so we have to enlarge the weight of these three to ensure that next time these three to be classified correctly. And next time we do the same thing, making sure 3 plus works great but misclassifies other points, re-weight them again. And do it again. After three steps, we combines those decision boundary together by giving them an “alpha” in accordance with its correction rate every round. Then you will see we have a perfect decision boundary out there.

Summary Recall: around 0.55 Precision: around 0.65 F1 Score: around 0.597 Accuracy: 97.7% AUC: 98.7% Two-Class Boosted Decision Tree Then we can make our prediction! Summary. That’s pretty much about my experiment!   Thank you all!