Titanic and Decision Trees Supplement. Titanic Predictions and Decision Trees Variable Selection Approaches – Hypothesis Driven – Data Driven – Kitchen.

Slides:



Advertisements
Similar presentations
Titanic Analytic model to predict survival in Titanic Disaster. By,
Advertisements

Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster Score = Number of Passengers.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Decision Tree Type of Data Qualitative (Categorical) Type of Categorization One Categorical Variable Chi-Square – Goodness-of-Fit Two Categorical Variables.
Logistic regression Who survived Titanic?.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Multiple Regression – Basic Relationships
Survival analysis. First example of the day Small cell lungcanser Meadian survival time: 8-10 months 2-year survival is 10% New treatment showed median.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)
What factors are most responsible for height?
Jan Stallaert Professor OPIM
Copyright © 2008, SAS Institute Inc. All rights reserved. RMS Titanic: Using SAS Enterprise Guide To Report On A Tragedy Matt Malczewski, SAS Canada.
Outline Class Intros – What are your goals? – What types of problems? datasets? Overview of Course Example Research Project.
Outline Class Intros Overview of Course & Series Example Research Projects Beginning R.
A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market.
Outline Class Intros Overview of Course Example Research Project.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
Loan Default Model Saed Sayad 1www.ismartsoft.com.
Lecture 8 Chi-Square STAT 3120 Statistical Methods I.
Project 1 FINA B. Group of 5. Due by 18/09/ parts. Each worth 50% of total. Need to provide 1 excel workbook for part 1 and part 2. This.
 Some variables are inherently categorical, for example:  Sex  Race  Occupation  Other categorical variables are created by grouping values of a.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
BY SANDY. WHAT IS DATAMINING TYPES OF DATAMINING TOOLS OVERVIEW OF TIBCO TIBCO SPOTFIRE MINER DATA ANALYSIS EXPLORE DATA MANIPULATE DATA CHART VIEW.
Syllabus. We covered Regression in Applied Stats. We will review Regression and cover Time Series and Principle Components Analysis. Reference Book.
In Stat-I, we described data by three different ways. Qualitative vs Quantitative Discrete vs Continuous Measurement Scales Describing Data Types.
Research Question What determines a person’s height?
Titanic: Machine Learning from Disaster
1 Chapter 15 Data Analysis: Basic Questions © 2005 Thomson/South-Western.
Submit Predictions Statistics & Analysis Data Management Hypotheses Goal Get Data Predict whom survived the Titanic Disaster.
Where to Get Data? Run an Experiment Use Existing Data.
What factors are most responsible for height?. Model Specification ERROR??? measurement error model error analysis unexplained unknown unaccounted for.
Outline Research Question: What determines height? Data Input Look at One Variable Compare Two Variables Children’s Height and Parents Height Children’s.
Main Themes Few vs. Many Variables Linear vs. Non-Linear Statistics vs. Machine Learning.
Continuous Outcome, Dependent Variable (Y-Axis) Child’s Height
FCI Supplement What determines FCI scores?. Explore FCI Dataset Descriptive Statistics Histograms Correlations Factor Analysis?
QM Spring 2002 Business Statistics Bivariate Analyses for Qualitative Data.
The goal of the project is to predict the survival of passengers based off a set of data. To do this we train a prediction system.
GROUP GOAL Learn and understand python programing language Libraries: Pandas Numpy SKlearn Use machine learning algorithms Decision trees Random Forests.
Data Analysis Module: Correlation and Regression
Just the basics: Learning about the essential steps to do some simple things in SPSS Larkin Lamarche.
Propensity Modeling and Targeted Marketing
A linear approach to predicting house prices
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Predict whom survived the Titanic Disaster
Basic Statistics Overview
Statistics for Psychology
Analytics in Higher Education: Methods Overview
Using Data Analytics to Predict Liquor Sales in Iowa State
Employee Turnover: Data Analysis and Exploration
Simple Linear Regression
Bivariate Testing (Chi Square)
TED Talks – A Predictive Analysis Using Classification Algorithms
Bivariate Testing (Chi Square)
What Makes a Difference: Research on Student GPA Using ANCOVA
Application of Logistic Regression Model to Titanic Data
Data Analysis Module: Chi Square
Classification Boundaries
Performing a regression analysis
15.1 The Role of Statistics in the Research Process
Multiple Regression – Split Sample Validation
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Welcome everyone. Been to good sessions, exciting ones coming up.
Decision trees MARIO REGIN.
Exercise 1: Entering data into SPSS
Global PaedSurg Research Training Fellowship
MASH R workshop 4.
Exploratory Analysis Report
March Madness Data Crunch Overview
Is Statistics=Data Science
Presentation transcript:

Titanic and Decision Trees Supplement

Titanic Predictions and Decision Trees Variable Selection Approaches – Hypothesis Driven – Data Driven – Kitchen Sink Algorithms – Proportions / Pivot Tables – Decision Trees Model Evaluations

VariableDescriptionTypeHyp Drive Data Driven Kitchen Sink pclassPassenger ClassCategoricalYes nameNameText Sex CategoricalYes ageAgeNumericYes sibspSiblings/ SpousesIntegerYes parchParents/ ChildrenIntegerYes ticketTicket NumberText farePassenger FareNumericYes cabinCabinText embarkedPort of EmbarkationCategoricalYes Predictor Variables Many factors were brainstormed – several that were beyond what is available in the data set

Number of Variables Analyzed Pivot Tables Predictive Modeling Correlation Matrices Regression Factor Analysis Histograms Applied Stats Cluster Analysis Decision Trees Types of Analysis Analytic Toolbox

Predict whom survived the Titanic Disaster Kaggle Submission Pivot Tables Correlation Matrices Logistic Regression? Factor Analysis? Histograms Cluster Analysis? Decision Trees Hyp Drive Data Driven Kitchen Sink Which Variables have the Highest Correlation for Survivia? pclass name Sex age sibsp parch embarked fare

Predict whom survived the Titanic Disaster Woman and Children First Read dataset into Excel, R, etc Kaggle Submission: 320 / 418 = 76.5% correct Hypothesis Driven: Gender Only Analyze Gender Only We have two categorical variables, therefore a pivot table works well

Hypothesis Driven: Pclass Only Predict whom survived the Titanic Disaster People on Lower Decks Less Likely to Survived Read dataset into Excel, R, etc Pclass

Hypothesis Driven: Pclass Only Predict whom survived the Titanic Disaster People on Lower Decks Less Likely to Survived Read dataset into Excel, R, etc

Hypothesis Driven: Age Predict whom survived the Titanic Disaster Woman and Children First Read dataset into Excel, R, etc Age has Missing Data

Survived: Categorical Variable Age: Continuous Variable Hypothesis Driven: Age Only How do we visualize and analyze age vs. survived?

Hypothesis Driven: Univariate Summary

Model Summary Variable SexX Age Pclass Name sibsp fare parch embarked Analytics Pivot TablesX Scatterplots Decision Trees Kaggle Score76.5