Titanic Analytic model to predict survival in Titanic Disaster. By,

Slides:



Advertisements
Similar presentations
SADC Course in Statistics Revision using CAST (Session 04)
Advertisements

1 Revisiting salary Acme Bank: Background A bank is facing a discrimination suit in which it is accused of paying its female employees.
Independent t-Test PowerPoint Prepared by Alfred P. Rovai
Introduction to Stats Honors Analysis. Data Analysis Individuals: Objects described by a set of data. (Ex: People, animals, things) Variable: Any characteristic.
1.What is Pearson’s coefficient of correlation? 2.What proportion of the variation in SAT scores is explained by variation in class sizes? 3.What is the.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
AP Statistics Section 4.2 Relationships Between Categorical Variables.
 It is defined as the ratio of the odds of an event occurring in one group to the odds of it occurring in another group, or to a sample-based estimate.
Test statistic: Group Comparison Jobayer Hossain Larry Holmes, Jr Research Statistics, Lecture 5 October 30,2008.
Logistic regression Who survived Titanic?.
Chapter 7 Section 6. Objectives 1 Copyright © 2012, 2008, 2004 Pearson Education, Inc. Solving Equations with Rational Expressions Distinguish between.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
Survival analysis. First example of the day Small cell lungcanser Meadian survival time: 8-10 months 2-year survival is 10% New treatment showed median.
Previously, we learned that adding two numbers together which have the same absolute value but are opposite in sign results in a value of zero. This can.
Two-Way Analysis of Variance STAT E-150 Statistical Methods.
Data Management & Basic Analysis Interpretation of Diagnostic test.
Relations in Categorical Data 1. When a researcher is studying the relationship between two variables, if both variables are numerical then scatterplots,
Categorical Data Prof. Andy Field.
The Chi-Square Distribution 1. The student will be able to  Perform a Goodness of Fit hypothesis test  Perform a Test of Independence hypothesis test.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Modeling Possibilities
1 Chapter 5 Two-Way Tables Associations Between Categorical Variables.
Hierarchical Binary Logistic Regression
. Chapter 3 Displaying and Describing Categorical Data.
Analyzing and Interpreting Quantitative Data
When trying to explain some of the patterns you have observed in your species and community data, it sometimes helps to have a look at relationships between.
Data Analysis Lab 02 Using Crosstabs to compare percentages.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Chi-Square Contingency Table Analysis.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
CADA Final Review Assessment –Continuous assessment (10%) –Mini-project (20%) –Mid-test (20%) –Final Examination (50%) 40% from Part 1 & 2 60% from Part.
Solving Equations with Rational Expressions Distinguish between operations with rational expressions and equations with terms that are rational expressions.
Recap of data analysis and procedures Food Security Indicators Training Bangkok January 2009.
Descriptive Research Study Investigation of Positive and Negative Affect of UniJos PhD Students toward their PhD Research Project Dr. K. A. Korb University.
Warm Up The number of motor vehicles registered (in millions) in the U.S. has grown as charted in the table. 1)Plot the number of vehicles against time.
Reasoning in Psychology Using Statistics Psychology
Multivariate Descriptive Research In the previous lecture, we discussed ways to quantify the relationship between two variables when those variables are.
Slide 1 Copyright © 2004 Pearson Education, Inc..
SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables.
Chapter 6: Analyzing and Interpreting Quantitative Data
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
PSC 47410: Data Analysis Workshop  What’s the purpose of this exercise?  The workshop’s research questions:  Who supports war in America?  How consistent.
Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster.
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
Unit 2 Descriptive Statistics Objective: To correctly identify and display sets of data.
APPLIED DATA ANALYSIS IN CRIMINAL JUSTICE CJ 525 MONMOUTH UNIVERSITY Juan P. Rodriguez.
BPS - 3rd Ed. Chapter 61 Two-Way Tables. BPS - 3rd Ed. Chapter 62 u In prior chapters we studied the relationship between two quantitative variables with.
AP Statistics Section 4.2 Relationships Between Categorical Variables
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Displaying & Describing Categorical Data Chapter 3.
The goal of the project is to predict the survival of passengers based off a set of data. To do this we train a prediction system.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
Titanic and Decision Trees Supplement. Titanic Predictions and Decision Trees Variable Selection Approaches – Hypothesis Driven – Data Driven – Kitchen.
Smart Start In June 2003, Consumer Reports published an article on some sport-utility vehicles they had tested recently. They had reported some basic.
Displaying and describing categorical data
Predict whom survived the Titanic Disaster
Analyzing and Interpreting Quantitative Data
AP Statistics Chapter 3 Part 3
Using Data Analytics to Predict Liquor Sales in Iowa State
Chapter 3: Displaying and Describing Categorical Data
Hypothesis Testing and Comparing Two Proportions
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Application of Logistic Regression Model to Titanic Data
Location and Party affiliation
Exercise 1: Entering data into SPSS
A Brief Introduction to Stata(2)
Karl L. Wuensch Department of Psychology East Carolina University
Displaying and Describing Categorical Data
Presentation transcript:

Titanic Analytic model to predict survival in Titanic Disaster. By, Varun Kadekar vjkadekar@gmail.com

Contents Problem description Data exploration Dependent Variable dependency Solution Approach Final Logit Equation Validation

Problem Description Need to predict the probability of survival based on the data available. Dataset train.csv has test data and test.csv has validation data. Data analysis made on train.csv dataset must be applied on validation data to check for the correctness of the solution.

Data Exploration/Preparation Check for outliers in the data. (Data looked good but for a few missing values in Age column) Treat the age column for missing data by substituting it with mean of the Age. If the Sex of the row with missing Age is ‘female’ then substitute the age by mean of Age of female passengers in the ship. The value is approximately 28. Likewise, for male passengers, its 31.

Dependent Variable Dependency The correlation between Pclass and Survival shows that more people from Higher Class have survived and more people from the lower class have died. Stats below and charts in next slide. We can observe that 372 out of 491 from Pclass 3 have not survived the accident, and more than 50% of people from higher class have survived. Survived Pclass   1 2 3 Sum 80 97 372 549 136 87 119 342 Total 216 184 491 891

Dependent Variable Contd… The below chart gives a clear idea on number of survivors against those dead per every PClass.

Dependent Variable Contd… Correlation between Age and survival shows more people below the age of 10 survived and the percentage of survival reduces with increase in age.

Dependent Variable Contd… Correlation between Sex and survival shows more men have died.

Solution Approach The correlation showed only following variables have significant impact. Pclass, SibSp, Age and Sex. Age is a continuous variable and hence we need to change it to categorical variable. Here is the approach I took: Age Bucket is ‘0’, if Age is between 0 and 10. Age Bucket is ‘1’ if Age is between 10 and 30. Age Bucket is ‘2’ if Age is greater than 30. Sex changed from character variable to numeric. Sex_Num is ‘0’ if Sex is ‘female’, else Sex_Num is ‘1’.

Final Logit Equation Prob of Survival = eXP^M/(1+exp^M) The logit model run on the dependent variable with independent variables explained in previous slide, gives the below logit equation for probability of survival. PClass --> Value of PClass in the input file age_buck --> Age_buck value is '0' if 0<age<=10. Age_buck is '1' if 10 <age<=30. Age_buck is '2' if 30 <age<100. SibSp --> Value from the input Sex_numeric --> This is a derived variable. Sex_numeric is '1' if sex in the input is 'Male'. Else Sex_numeric is '0'. Prob of Survival = eXP^M/(1+exp^M) where M = 4.7905 + (PClass)*(-1.1010)+(age_buck)*(-0.7365)+ (SibSp)*(-0.3584)+(Sex_numeric)*(-2.6210)

Validation Applied the logit equation against validation dataset, test.csv. Below chart shows the probability of survival. The model seems to have rightly predicted the probability of survival. We could observe that if the model has predicted the probability of survival to be more than 90%, then in real, they have indeed survived. As the prob of survival reduces, we can observe that more people have actually died.

Validation Additional validation proof attached below. In the excel below, column P shows the predicted probability of Survival by the model. The column O shows the actual Survival variable from the myfirstforest.csv dataset.

Thank you… Varun Kadekar vjkadekar@gmail.com