Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market.

Similar presentations


Presentation on theme: "A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market."— Presentation transcript:

1 A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market Boston Housing Old Faithful Gene Array Climate MA Schools Project Management Our datasets have been relatively small, with no time variable.

2 Comparison of Datasets HeightsHousingTitanic # of Variables81511 # Variables Used41011 # Numerical386 # Categorical125 # of Observations934506891 UnivariateYes Bivariate – CorrelationsYes Missing DataNo Yes - Age Variable TransformationsYes - GaltonPotentiallyYes - Age Data RelationshipsLinear Non-Linear, Partitioned Regression?Yes No Factor Analysis Relevant?NoYesNo Decision Trees Relevant?NoPotentiallyYes Cluster Analysis Relevant?For 1 VariableYesPotentially Merge Analytic Models?NoPotentiallyYes We can apply similar basic stats to each of these datasets. Depending on the type of data relationships, the newer techniques may (or may not) be applicable This slidedeck walks through the Heights DataSet

3 Research Question What determines a person’s height?

4 Genetics Nutrition Immigration / Origins Disease Hypothesis Brainstorming Sons will be similar to their Dad’s height Daughters will be similar to their Mom’s height Hypotheses:

5 Height Dataset Variables heights <- read.csv("GaltonFamilies.csv") Observations: 934 Variables: 8

6 We only need a subset of the data Dataset Variables and Selection

7 Histograms Heights of Father, Mother, and Child Appear Normal

8 Scatterplots Child Height somewhat correlated to Father, Mother Heights

9 Correlations Matrices library(car) scatterplotMatrix(heights) library(PerformanceAnalytics) chart.Correlation(heights.num) With Categorical Variable (Gender) Only Numerical Variables

10 Children Height by Gender Noticeably difference between Gender for Heights

11 Categorical: Box Plot

12 Linear Regression Modeling X’s Independent Variables Dependent Variable Y X4 X3 X2X1 delta

13 Comparing Regression Models Variable1234567 Father0.390.360.39 Mom0.310.290.32 Gender5.135.215.20 ChildNum-0.16-0.04 Intercept40.146.664.167.722.616.517.4 R-squares0.070.040.510.020.100.6350.636 With 4 Variables there are 24 different combinations. Fortunately there is a R Library, LEAPS, that can help.

14 LEAPS Package The three variable model appears best trade-off between explanation and simplicity Goes through different combination of variables to find best ones R-Square Variable If not highlighted, then not in model If highlighted, then in model Finds best combination of variables. Starts with 1 variable, then 2, and so on.

15 Height Dataset Summary What determines height? Not able to get data for all our variables! Gender has the biggest effect Parent’s Height also influence a Child’s Height GeneticsNutritionGender Heels Child’s Height Parent’s Height Child Height = 16. 5 + Father’s Height * 0.39 + Mother’s Height * 0.29 and If a Male then add 5.21 inches.

16 Number of Variables Analyzed Pivot Tables 6+ 5 4 3 2 1 Predictive Modeling Class Correlation Matrices Regression Factor Analysis Histograms Applied Stats Class Cluster Analysis Decision Trees Types of Analysis Additional Techniques

17 Factor Analysis on Height Dataset? In this case, FA would not be of any help There is little correlation among predictors

18 For Decision Tree Analysis what would be the First Variable? Linear or Partitioned Data?

19 Linear Models vs. Decision Trees Height variable relationships appear linear Decision Trees Would Not Appear to Help

20 Decision Tree Output This model appears less accurate than the regression model AS THE OUTPUT IS IN DISCRETE VALUES

21 Would Cluster Analysis Be Helpful?

22 Cluster Analysis on Three Variables? Results not surprising; though unclear as to how to leverage

23 Continuous Convert Continuous to Categorical Height Categorical S MLXL

24 Cluster Analysis of Child Heights 5 Cluster - S, M, L, XL, XXL Children Heights for each of the Cluster Centers Cluster Centers


Download ppt "A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market."

Similar presentations


Ads by Google