Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster Score = Number of Passengers in Test Dataset Correctly Predict Passenger’s Fate
Training and Test Data Training Data N=891 39% Survived Test Data N=418 All Titanic Passengers N= 2,223 Develop Model How similar is the Test Data to the Training Data? If Similar, then model should do well. If Differenet, then model could perform poorly.
Kitchen Sink Over-Fitting?
Decision Tree Pruning model.6 <- rpart(survived ~ sex + age + pclass + sibsp + parch + fare + embarked, data = train_data, maxdepth=2)
Hold Out and Cross-Validation
Random Forest: Multiple Trees
Confusion Matrix 01%Err % % 44618% RandomForestGenderDecision Tree 01%Err % % 44620% 01%Err % % 44621% False Positives False Negatives
Model Ceiling Gender Model Seems Realistic
survivedpclassNamesexagesibspparchticketFarecabinembarked 12Louch, Mrs. Charles Alexander (Alice Adelaide Slow)female4210 SC/AH S 02Carter, Mrs. Ernest Courtenay (Lilian Hughes)female S 13Asplund, Miss. Lillian Gertrudfemale S 03Andersson, Miss. Ebba Iris Alfridafemale S 11Bjornstrom-Steffansson, Mr. Mauritz Hakanmale C52S 01Long, Mr. Milton Clydemale D6S 11Simonius-Blumer, Col. Oberst Alfonsmale A26C 01Smith, Mr. James Clinchmale A7C Why a Model Ceiling? Below are 4 pairs of passengers with very similar Predictor Variables; Yet, within each pair, one survived, and the other did not. At some point there just isn’t the data / variable to help make an accurate prediction.