Download presentation
Presentation is loading. Please wait.
Published byGyles Lamb Modified over 9 years ago
1
A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market Boston Housing Old Faithful Gene Array Climate MA Schools Project Management Our datasets have been relatively small, with no time variable.
2
Comparison of Datasets HeightsHousingTitanic # of Variables81511 # Variables Used41011 # Numerical386 # Categorical125 # of Observations934506891 UnivariateYes Bivariate – CorrelationsYes Missing DataNo Yes - Age Variable TransformationsYes - GaltonPotentiallyYes - Age Data RelationshipsLinear Non-Linear, Partitioned Regression?Yes No Factor Analysis Relevant?NoYesNo Decision Trees Relevant?NoPotentiallyYes Cluster Analysis Relevant?For 1 VariableYesPotentially Merge Analytic Models?NoPotentiallyYes We can apply similar basic stats to each of these datasets. Depending on the type of data relationships, the newer techniques may (or may not) be applicable This slidedeck walks through the Heights DataSet
3
Research Question What determines a person’s height?
4
Genetics Nutrition Immigration / Origins Disease Hypothesis Brainstorming Sons will be similar to their Dad’s height Daughters will be similar to their Mom’s height Hypotheses:
5
Height Dataset Variables heights <- read.csv("GaltonFamilies.csv") Observations: 934 Variables: 8
6
We only need a subset of the data Dataset Variables and Selection
7
Histograms Heights of Father, Mother, and Child Appear Normal
8
Scatterplots Child Height somewhat correlated to Father, Mother Heights
9
Correlations Matrices library(car) scatterplotMatrix(heights) library(PerformanceAnalytics) chart.Correlation(heights.num) With Categorical Variable (Gender) Only Numerical Variables
10
Children Height by Gender Noticeably difference between Gender for Heights
11
Categorical: Box Plot
12
Linear Regression Modeling X’s Independent Variables Dependent Variable Y X4 X3 X2X1 delta
13
Comparing Regression Models Variable1234567 Father0.390.360.39 Mom0.310.290.32 Gender5.135.215.20 ChildNum-0.16-0.04 Intercept40.146.664.167.722.616.517.4 R-squares0.070.040.510.020.100.6350.636 With 4 Variables there are 24 different combinations. Fortunately there is a R Library, LEAPS, that can help.
14
LEAPS Package The three variable model appears best trade-off between explanation and simplicity Goes through different combination of variables to find best ones R-Square Variable If not highlighted, then not in model If highlighted, then in model Finds best combination of variables. Starts with 1 variable, then 2, and so on.
15
Height Dataset Summary What determines height? Not able to get data for all our variables! Gender has the biggest effect Parent’s Height also influence a Child’s Height GeneticsNutritionGender Heels Child’s Height Parent’s Height Child Height = 16. 5 + Father’s Height * 0.39 + Mother’s Height * 0.29 and If a Male then add 5.21 inches.
16
Number of Variables Analyzed Pivot Tables 6+ 5 4 3 2 1 Predictive Modeling Class Correlation Matrices Regression Factor Analysis Histograms Applied Stats Class Cluster Analysis Decision Trees Types of Analysis Additional Techniques
17
Factor Analysis on Height Dataset? In this case, FA would not be of any help There is little correlation among predictors
18
For Decision Tree Analysis what would be the First Variable? Linear or Partitioned Data?
19
Linear Models vs. Decision Trees Height variable relationships appear linear Decision Trees Would Not Appear to Help
20
Decision Tree Output This model appears less accurate than the regression model AS THE OUTPUT IS IN DISCRETE VALUES
21
Would Cluster Analysis Be Helpful?
22
Cluster Analysis on Three Variables? Results not surprising; though unclear as to how to leverage
23
Continuous Convert Continuous to Categorical Height Categorical S MLXL
24
Cluster Analysis of Child Heights 5 Cluster - S, M, L, XL, XXL Children Heights for each of the Cluster Centers Cluster Centers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.