Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster Score = Number of Passengers.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Titanic Analytic model to predict survival in Titanic Disaster. By,
DECISION TREES. Decision trees  One possible representation for hypotheses.
Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Learning Algorithm Evaluation
Machine Learning 102 Jeff Heaton.
Regression and Correlation
Credit Card Applicants’ Credibility Prediction with Decision Tree n Dan Xiao n Jerry Yang.
Three kinds of learning
Logistic regression Who survived Titanic?.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Lasso regression. The Goals of Model Selection Model selection: Choosing the approximate best model by estimating the performance of various models Goals.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Survival analysis. First example of the day Small cell lungcanser Meadian survival time: 8-10 months 2-year survival is 10% New treatment showed median.
Evaluation of Learning Models
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Statistics and Research methods Wiskunde voor HMI Bijeenkomst 3 Relating statistics and experimental design.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Data Analysis Lab 02 Using Crosstabs to compare percentages.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
BOF Trees Visualization  Zagreb, June 12, 2004 BOF Trees Visualization  Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Appendix D: Application of Genetic Algorithm in Classification Duong Tuan Anh 5/2014.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Validation.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.
1 Some more examples Client satisfaction Products sold Trusted advisor score Net growth TOP PERFORMERS Age diversity HIGH Credibility HIGH Absenteeism.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Titanic: Machine Learning from Disaster
Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster.
Linear Discriminant Analysis and Logistic Regression.
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Section Conditional Probability Objectives: 1.Understand the meaning of conditional probability. 2.Learn the general Multiplication Rule:
Using Classification Trees to Decide News Popularity
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
1 Statistics & R, TiP, 2011/12 Tree-Based Methods  Methods for analyzing problems of discrimination and regression  Classification & Decision Trees For.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Having sex will give you lung cancer Statistics show that people who smoke cigarettes: Drink more alcohol Drink more coffee Get less exercise Have more.
Classification Today: Basic Problem Decision Trees.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.
Logan And Aidan's Presentation
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
The goal of the project is to predict the survival of passengers based off a set of data. To do this we train a prediction system.
Multiplication Find the missing value x __ = 32.
Titanic and Decision Trees Supplement. Titanic Predictions and Decision Trees Variable Selection Approaches – Hypothesis Driven – Data Driven – Kitchen.
GROUP GOAL Learn and understand python programing language Libraries: Pandas Numpy SKlearn Use machine learning algorithms Decision trees Random Forests.
Robert Anderson SAS JMP
JMP Discovery Summit 2016 Janet Alvarado
Erich Smith Coleman Platt
Predict whom survived the Titanic Disaster
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Teaching Analytics with Case Studies: Finding Love in a Classification Tree Ruth Hummel, PhD JMP Academic Ambassador.
More About ANOVA BPS 7e Chapter 30 © 2015 W. H. Freeman and Company.
Application of Logistic Regression Model to Titanic Data
Classification with CART
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Welcome everyone. Been to good sessions, exciting ones coming up.
MIS2502: Data Analytics Classification Using Decision Trees
Decision trees MARIO REGIN.
Exercise 1: Entering data into SPSS
Analysis on Accelerated Learning Cohorts
Exploratory Analysis Report
Presentation transcript:

Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster Score = Number of Passengers in Test Dataset Correctly Predict Passenger’s Fate

Training and Test Data Training Data N=891 39% Survived Test Data N=418 All Titanic Passengers N= 2,223 Develop Model How similar is the Test Data to the Training Data? If Similar, then model should do well. If Differenet, then model could perform poorly.

Kitchen Sink Over-Fitting?

Decision Tree Pruning model.6 <- rpart(survived ~ sex + age + pclass + sibsp + parch + fare + embarked, data = train_data, maxdepth=2)

Hold Out and Cross-Validation

Random Forest: Multiple Trees

Confusion Matrix 01%Err % % 44618% RandomForestGenderDecision Tree 01%Err % % 44620% 01%Err % % 44621% False Positives False Negatives

Model Ceiling Gender Model Seems Realistic

survivedpclassNamesexagesibspparchticketFarecabinembarked 12Louch, Mrs. Charles Alexander (Alice Adelaide Slow)female4210 SC/AH S 02Carter, Mrs. Ernest Courtenay (Lilian Hughes)female S 13Asplund, Miss. Lillian Gertrudfemale S 03Andersson, Miss. Ebba Iris Alfridafemale S 11Bjornstrom-Steffansson, Mr. Mauritz Hakanmale C52S 01Long, Mr. Milton Clydemale D6S 11Simonius-Blumer, Col. Oberst Alfonsmale A26C 01Smith, Mr. James Clinchmale A7C Why a Model Ceiling? Below are 4 pairs of passengers with very similar Predictor Variables; Yet, within each pair, one survived, and the other did not. At some point there just isn’t the data / variable to help make an accurate prediction.