Download presentation
Presentation is loading. Please wait.
Published byDaniella White Modified over 9 years ago
2
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable selection Confounding variables revisited
3
Statistics: Unlocking the Power of Data Lock 5 US States We will build a model to predict the % of the state that voted for Obama (out of the two party vote) in the 2012 US presidential election, using the 50 states as cases Sample? Population? This can help us to understand how certain features of a state are associated with political beliefs
4
Statistics: Unlocking the Power of Data Lock 5 Categorical Variables For this to make any sense, each x value has to be a number. How do we include categorical variables in a regression setting?
5
Statistics: Unlocking the Power of Data Lock 5 Categorical Variables Take one categorical variable, and replace it with several “dummy” variables A dummy variable is 1 if the case falls into the category represented by the dummy variable, and 0 otherwise Create one dummy variable for each category of the categorical variable
6
Statistics: Unlocking the Power of Data Lock 5 Dummy Variables StateRegionSouthWestNortheastMidwest AlabamaSouth1000 AlaskaWest0100 ArkansasSouth1000 CaliforniaWest0100 ColoradoWest0100 ConnecticutNortheast0010 DelawareNortheast0010 FloridaSouth1000 GeorgiaSouth1000 HawaiiWest0100 ………………
7
Statistics: Unlocking the Power of Data Lock 5 Dummy Variables When using dummy variables, one has to be left out of the model The dummy variable left out is called the reference level When using region of the country (Northeast, South, Midwest, West) to predict % Obama vote, how many dummy variables will be included? a)Oneb) Twoc) Three d) Four There are four categories, but one is left out as the reference level
8
Statistics: Unlocking the Power of Data Lock 5 Dummy Variables Predicting % vote for Obama with one categorical variable: region of the country If “midwest” is the reference level:
9
Statistics: Unlocking the Power of Data Lock 5 Voting by Region Based on the output above, which region had the highest percent vote for Obama? a)Midwest b)Northeast c)South d)West
10
Statistics: Unlocking the Power of Data Lock 5 Voting by Region What is the predicted % Obama vote for a state in the northeast? a)13% b)47% c)55% d)60%.4697 + 0.1309(1) + 0 + 0 = 0.60
11
Statistics: Unlocking the Power of Data Lock 5 Voting by Region What is the predicted % Obama vote for a state in the midwest? a)50% b)47% c)0% d)45% 0.4697 + 0 + 0 + 0
12
Statistics: Unlocking the Power of Data Lock 5 Categorical Variables The p-value for each dummy variable tests for a significant difference between that category and the reference level For an overall p-value for the significance of the categorical variable with multiple categories, use a)z-test b)T-test c)Chi-square test d)ANOVA Quantitative response, categorical explanatory with multiple categories
13
Statistics: Unlocking the Power of Data Lock 5 Categorical Variables ANOVA for Regression: ANOVA for Difference in Means:
14
Statistics: Unlocking the Power of Data Lock 5 Categorical Variables in R R automatically creates dummy variables for you if you include a categorical explanatory variable The first level alphabetically is usually the reference level If you want to change the reference level, see me
15
Statistics: Unlocking the Power of Data Lock 5 Categorical Variables Either all dummy variables associated with a categorical variable have to be included in the model, or none of them RegionS and RegionW are not significant, but leaving them out would clump the South and the West with the reference level, Midwest, which does not make sense
16
Statistics: Unlocking the Power of Data Lock 5 Regression Model
17
Statistics: Unlocking the Power of Data Lock 5 West Region With only region as an explanatory variable, interpret the positive coefficient of RegionW. With all the other explanatory variables included, interpret the negative coefficient of RegionW. In this data set, states in the West voted more for Obama than states in the Midwest. States in the West voted less for Obama than would be expected based on the other variables in the model, as compared to states in the Midwest.
18
Statistics: Unlocking the Power of Data Lock 5 Smoking Given all the other variables in the model, states with a higher percentage of smokers are more likely to vote (a) Republican (b) Democratic (c) Impossible to tell The coefficient is positive, so the % Obama vote is lower in states with a higher percentage of smokers
19
Statistics: Unlocking the Power of Data Lock 5 Smoking The correlation between percent of people smoking in a state and the percent of people voting for Obama in 2012 was (a) Positive (b) Negative (c) Impossible to tell This only tells you the relationship given the other variables in the model.
20
Statistics: Unlocking the Power of Data Lock 5 Smokers If smoking was banned in a state, the percentage of smokers would most likely decrease. In that case, the percentage voting Democratic would… (a) increase (b) decrease (c) impossible to tell We cannot make conclusions about causality from observational data.
21
Statistics: Unlocking the Power of Data Lock 5 Causation A significant explanatory variable in a regression model indicates association, but not necessarily causation CAUSALITY CAN ONLY BE INFERRED FROM A RANDOMIZED EXPERIMENT!!!!
22
Statistics: Unlocking the Power of Data Lock 5 Goal of the Model? If the goal of the model is to see what and how each variable is associated with a state’s voting patterns, given all the other variables in the model, then we are done If the goal is to predict the % of the vote that will be for the democrat, say in the 2016 election, we want to prune out insignificant variables to improve the model
23
Statistics: Unlocking the Power of Data Lock 5 Variable Selection The p-value for an explanatory variable can be taken as a rough measure for how helpful that explanatory variable is to the model Insignificant variables may be pruned from the model, as long as adjusted R 2 doesn’t go down too much You can also look at relationships between explanatory variables; if two are strongly associated, perhaps both are not necessary
24
Statistics: Unlocking the Power of Data Lock 5 Variable Selection (Some) ways of deciding whether a variable should be included in the model or not: 1.Does it improve adjusted R 2 ? 2.Does it have a low p-value? 3.Is it associated with the response by itself? 4.Is it strongly associated with another explanatory variables? (If yes, then including both may be redundant) 5.Does common sense say it should contribute to the model?
25
Statistics: Unlocking the Power of Data Lock 5 Full Model Highest p-value
26
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 1 Highest p-value
27
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 2 Highest p-value
28
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 3 Highest p-value
29
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 4 Highest p-value
30
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5 Highest p-value
31
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 6
32
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5
33
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 7
34
Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5 FINAL STEPWISE MODEL
35
Statistics: Unlocking the Power of Data Lock 5 Full Model
36
Statistics: Unlocking the Power of Data Lock 5 Variable Selection There is no one “best” model Choosing a model is just as much an art as a science Adjusted R 2 is just one possible criteria To learn much more about choosing the best model, take STAT 210
37
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Cases: countries of the world Response variable: life expectancy Explanatory variable: electricity use (kWh per capita) Is a country’s electricity use helpful in predicting life expectancy?
38
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy
39
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Outlier: Iceland
40
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy
41
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy
42
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is this a good model for predicting life expectancy based on electricity use? (a) Yes (b) No The association is definitely not linear.
43
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is a country’s electricity use helpful in predicting life expectancy? (a) Yes (b) No The p-value for electricity is significant.
44
Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy If we increased electricity use in a country, would life expectancy increase? (a) Yes (b) No (c) Impossible to tell We cannot make any conclusions about causality, because this is observational data.
45
Statistics: Unlocking the Power of Data Lock 5 Project 2 Part 3: Two Variable Comparisons Could have been done entirely before Exam 2 Relationships between explanatory variables Can ignore beauty variables besides bty_avg Only need to report significant relationships This is a great part of the project to divvy up Poster Regular poster board – can get it at Duke bookstore Include all components of the project, but up to you what to display from each and how to display it
46
Statistics: Unlocking the Power of Data Lock 5 To Do Read 10.3 Do Homework 8 (due Thursday, 11/29)Homework 8 Do Project 2 (poster due Monday, 12/3, paper due 12/6)Project 2
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.