Logistic regression Who survived Titanic?.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Titanic Analytic model to predict survival in Titanic Disaster. By,
Sociology 680 Multivariate Analysis Logistic Regression.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Regression analysis Linear regression Logistic regression.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Logistic Regression Part I - Introduction. Logistic Regression Regression where the response variable is dichotomous (not continuous) Examples –effect.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Multipe and non-linear regression. What is what? Regression: One variable is considered dependent on the other(s) Correlation: No variables are considered.
N-way ANOVA. 3-way ANOVA 2 H 0 : The mean respiratory rate is the same for all species H 0 : The mean respiratory rate is the same for all temperatures.
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
1 G Lect 11M Binary outcomes in psychology Can Binary Outcomes Be Studied Using OLS Multiple Regression? Transforming the binary outcome Logistic.
An Introduction to Logistic Regression
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Survival analysis. First example of the day Small cell lungcanser Meadian survival time: 8-10 months 2-year survival is 10% New treatment showed median.
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
STAT E-150 Statistical Methods
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1
Logistic Regression. Outline Review of simple and multiple regressionReview of simple and multiple regression Simple Logistic RegressionSimple Logistic.
Wednesday PM  Presentation of AM results  Multiple linear regression Simultaneous Simultaneous Stepwise Stepwise Hierarchical Hierarchical  Logistic.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1
Chapter 6 Regression Algorithms in Data Mining
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 12.4.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.
Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
Logistic Regression Database Marketing Instructor: N. Kumar.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
Logistic Regression. Conceptual Framework - LR Dependent variable: two categories with underlying propensity (yes/no) (absent/present) Independent variables:
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
APPLIED DATA ANALYSIS IN CRIMINAL JUSTICE CJ 525 MONMOUTH UNIVERSITY Juan P. Rodriguez.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Week 5: Logistic regression analysis Overview Questions from last week What is logistic regression analysis? The mathematical model Interpreting the β.
Regression Models Fit data Time-series data: Forecast Other data: Predict.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Chapter 4: Introduction to Predictive Modeling: Regressions
Multiple Logistic Regression STAT E-150 Statistical Methods.
Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Logistic Regression Analysis Gerrit Rooks
Dates Presentations Wed / Fri Ex. 4, logistic regression, Monday Dec 7 th Final Tues. Dec 8 th, 3:30.
Binary logistic regression. Characteristic Regression model for target categorized variable explanatory variables – continuous and categorical Estimate.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Construction Engineering 221 Probability and statistics Normal Distribution.
Titanic and Decision Trees Supplement. Titanic Predictions and Decision Trees Variable Selection Approaches – Hypothesis Driven – Data Driven – Kitchen.
BINARY LOGISTIC REGRESSION
Logistic regression.
Logistic Regression APKC – STATS AFAC (2016).
Binary Logistic Regression
Notes on Logistic Regression
Chapter 13 Nonlinear and Multiple Regression
Predict whom survived the Titanic Disaster
Multiple logistic regression
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
Application of Logistic Regression Model to Titanic Data
Correlation and Regression
Exercise 1: Entering data into SPSS
Regression and Correlation of Data
Presentation transcript:

Logistic regression Who survived Titanic?

The sinking of Titanic Titanic sank April 14th 1912 with 2228 souls 705 survived. A dataset of 1309 passengers survived. Who survived?

The data Sibsp is the number of siblings and/or spouses accompanying pclass survived name sex age sibsp parch 1 Allen, Miss. Elisabeth Walton female 29 Allison, Master. Hudson Trevor male 0.9167 2 Allison, Miss. Helen Loraine Allison, Mr. Hudson Joshua Creighton 30 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 25 Anderson, Mr. Harry 48 Andrews, Miss. Kornelia Theodosia 63 Andrews, Mr. Thomas Jr 39 Appleton, Mrs. Edward Dale (Charlotte Lamson) 53 Sibsp is the number of siblings and/or spouses accompanying Parsc is the number of parents and/or children accompanying Some values are missing Can we predict who will survive titanic II?

Analyzing the data in a (too) simple manner Associations between factors without considering interactions

Analyzing the data in a (too) simple manner Associations between factors without considering interactions

Analyzing the data in a (too) simple manner Associations between factors without considering interactions

Analyzing the data in a (too) simple manner Associations between factors without considering interactions

Analyzing the data in a (too) simple manner Associations between factors without considering interactions

Could we use multiple linear regression to predict survival? Logistic regression Response variable is defined between –inf and +inf Response variable is defined between 0 and 1 Normal distributed Bernoulli distributed

Logit transformation is modeled linearly The logistic function

The sigmodal curve

The sigmodal curve The intercept basically just ‘scale’ the input variable

The sigmodal curve The intercept basically just ‘scale’ the input variable Large regression coefficient → risk factor strongly influences the probability

The sigmodal curve The intercept basically just ‘scale’ the input variable Large regression coefficient → risk factor strongly influences the probability Positive regression coefficient → risk factor increases the probability

Logistic regression of the Titanic data

Logistic regression of the Titanic data Summary of data Coding of the dependent variable Coding of the categorical explanatory variable: First class: 1 Second class: 2 Third class: reference

Logistic regression of the Titanic data A fit of the null-model, basically just the intercept. Usually not interesting The total probability of survival is 500/1309 = 0.382. Cutoff is 0.5 so all are classified as non-survivers. Basically tests if the null-model is sufficient. It almost certainly is not. Shows that survival is related to pclass (which is not in the null-model)

Logistic regression of the Titanic data Omnibus test: Uses LR to describe if the adding the pclass variable to the model makes it better. It did! But better than the null-model, so no surprise. Model Summary. Other measures of the goodness of fit. Classification table: By including pclass 67.7 passengers were correctly categorized. Variables in the equation: first line repeats that pclass has a significant effect on survival. B is the logistic fittet parameter. Exp(B) is the odds rations, so the odds of survival is 4.7 (3.6-6.3) times higher than passengers on third class (reference class)

Logistic regression of the Titanic data now adding family relations ‘3 or more’ is set as reference groups by SPSS

Logistic regression of the Titanic data now adding family relations The model correctly classify 79.1% of the passengers

Logistic regression of the Titanic data now adding family relations Basically all factors seems to affect the probability of survival.

How was it with age? Linear associations are easy to model, because the factor enters the predictive value directly. But it is not really look linear, maybe a third order polynomial? Three new factors for age is calculated: first, second, and third order of the age divided by the standard diviation.

How was it with age? The third-order age factor did not add significantly to the model. By adding third order polynomial the model can correctly categorize 79.4 vs 79.1 before. ParChild is no longer a significant factor and can be omitted from the model

Using the model to predict survival Omitting the second and third order age and ParChild factors What is the probability that a 25 year old woman accompanied only by her husband holding a second class ticket would survive Titanic? z = -3.929 -0.589*(-5)/14.41 +1.718 +2.552 +0.926 = 1.4714

Analysing interaction of selected factors pclass * sex, age * sex, pclass * Siblings/Parents But the model does not converge…

Analysing interaction of selected factors Collapsing the sibling/spouse number eradicated their mutual interaction

Is it realistic that Leonardo survives and the chick dies?