Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable.

Slides:



Advertisements
Similar presentations
STAT 101 Dr. Kari Lock Morgan
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Multiple Regression II 4/11/12 Categorical explanatory variables Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Categorical variables Variable selection.
Introduction to Statistics: Political Science (Class 9) Review.
Statistics for Managers Using Microsoft® Excel 5th Edition
Stat 512 – Lecture 18 Multiple Regression (Ch. 11)
January 6, morning session 1 Statistics Micro Mini Multiple Regression January 5-9, 2008 Beth Ayers.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Stat 217 – Week 10. Outline Exam 2 Lab 7 Questions on Chi-square, ANOVA, Regression  HW 7  Lab 8 Notes for Thursday’s lab Notes for final exam Notes.
Lecture 24: Thurs., April 8th
19 May Crawford School 1 Basic Statistics – 1 Semester 1, 2009 POGO8096/8196: Research Methods Crawford School of Economics and Government.
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Correlation and Regression Analysis
Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Leedy and Ormrod Ch. 11 Gray Ch. 14
Synthesis and Review 3/26/12 Multiple Comparisons Review of Concepts Review of Methods - Prezi Essential Synthesis 3 Professor Kari Lock Morgan Duke University.
STAT 250 Dr. Kari Lock Morgan
Examining Relationships Prob. And Stat. CH.2.1 Scatterplots.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 (?) Multiple explanatory variables.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 12.4.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 9.1 Inference for correlation Inference for.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Inference after ANOVA, Multiple Comparisons 3/21/12 Inference after ANOVA The problem of multiple comparisons Bonferroni’s Correction Section 8.2 Professor.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Chapter 16 Data Analysis: Testing for Associations.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTIONS 9.1, 9.3 Inference for slope (9.1)
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Variable selection Confounding variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 12/6/12 Synthesis Big Picture Essential Synthesis Bayesian Inference (continued)
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Synthesis Big Picture Essential Synthesis Synthesis and Review.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6 Least squares line Interpreting coefficients.
Making Comparisons All hypothesis testing follows a common logic of comparison Null hypothesis and alternative hypothesis – mutually exclusive – exhaustive.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTION 2.6 Interpreting coefficients Prediction.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 9.1 Inference for correlation Inference for.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Stats Methods at IC Lecture 3: Regression.
Chapter 14 Introduction to Multiple Regression
Multiple Regression Analysis and Model Building
STAT 250 Dr. Kari Lock Morgan
Cautions about Correlation and Regression
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
Scatterplots, Association, and Correlation
Regression and Categorical Predictors
Chapter 4: More on Two-Variable Data
Presentation transcript:

Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable selection Confounding variables revisited

Statistics: Unlocking the Power of Data Lock 5 US States We will build a model to predict the % of the state that voted for Obama (out of the two party vote) in the 2012 US presidential election, using the 50 states as cases Sample? Population? This can help us to understand how certain features of a state are associated with political beliefs

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables For this to make any sense, each x value has to be a number. How do we include categorical variables in a regression setting?

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables Take one categorical variable, and replace it with several “dummy” variables A dummy variable is 1 if the case falls into the category represented by the dummy variable, and 0 otherwise Create one dummy variable for each category of the categorical variable

Statistics: Unlocking the Power of Data Lock 5 Dummy Variables StateRegionSouthWestNortheastMidwest AlabamaSouth1000 AlaskaWest0100 ArkansasSouth1000 CaliforniaWest0100 ColoradoWest0100 ConnecticutNortheast0010 DelawareNortheast0010 FloridaSouth1000 GeorgiaSouth1000 HawaiiWest0100 ………………

Statistics: Unlocking the Power of Data Lock 5 Dummy Variables When using dummy variables, one has to be left out of the model The dummy variable left out is called the reference level When using region of the country (Northeast, South, Midwest, West) to predict % Obama vote, how many dummy variables will be included? a)Oneb) Twoc) Three d) Four There are four categories, but one is left out as the reference level

Statistics: Unlocking the Power of Data Lock 5 Dummy Variables Predicting % vote for Obama with one categorical variable: region of the country If “midwest” is the reference level:

Statistics: Unlocking the Power of Data Lock 5 Voting by Region Based on the output above, which region had the highest percent vote for Obama? a)Midwest b)Northeast c)South d)West

Statistics: Unlocking the Power of Data Lock 5 Voting by Region What is the predicted % Obama vote for a state in the northeast? a)13% b)47% c)55% d)60% (1) = 0.60

Statistics: Unlocking the Power of Data Lock 5 Voting by Region What is the predicted % Obama vote for a state in the midwest? a)50% b)47% c)0% d)45%

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables The p-value for each dummy variable tests for a significant difference between that category and the reference level For an overall p-value for the significance of the categorical variable with multiple categories, use a)z-test b)T-test c)Chi-square test d)ANOVA Quantitative response, categorical explanatory with multiple categories

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables ANOVA for Regression: ANOVA for Difference in Means:

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables in R R automatically creates dummy variables for you if you include a categorical explanatory variable The first level alphabetically is usually the reference level If you want to change the reference level, see me

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables Either all dummy variables associated with a categorical variable have to be included in the model, or none of them RegionS and RegionW are not significant, but leaving them out would clump the South and the West with the reference level, Midwest, which does not make sense

Statistics: Unlocking the Power of Data Lock 5 Regression Model

Statistics: Unlocking the Power of Data Lock 5 West Region With only region as an explanatory variable, interpret the positive coefficient of RegionW. With all the other explanatory variables included, interpret the negative coefficient of RegionW. In this data set, states in the West voted more for Obama than states in the Midwest. States in the West voted less for Obama than would be expected based on the other variables in the model, as compared to states in the Midwest.

Statistics: Unlocking the Power of Data Lock 5 Smoking Given all the other variables in the model, states with a higher percentage of smokers are more likely to vote (a) Republican (b) Democratic (c) Impossible to tell The coefficient is positive, so the % Obama vote is lower in states with a higher percentage of smokers

Statistics: Unlocking the Power of Data Lock 5 Smoking The correlation between percent of people smoking in a state and the percent of people voting for Obama in 2012 was (a) Positive (b) Negative (c) Impossible to tell This only tells you the relationship given the other variables in the model.

Statistics: Unlocking the Power of Data Lock 5 Smokers If smoking was banned in a state, the percentage of smokers would most likely decrease. In that case, the percentage voting Democratic would… (a) increase (b) decrease (c) impossible to tell We cannot make conclusions about causality from observational data.

Statistics: Unlocking the Power of Data Lock 5 Causation A significant explanatory variable in a regression model indicates association, but not necessarily causation CAUSALITY CAN ONLY BE INFERRED FROM A RANDOMIZED EXPERIMENT!!!!

Statistics: Unlocking the Power of Data Lock 5 Goal of the Model? If the goal of the model is to see what and how each variable is associated with a state’s voting patterns, given all the other variables in the model, then we are done If the goal is to predict the % of the vote that will be for the democrat, say in the 2016 election, we want to prune out insignificant variables to improve the model

Statistics: Unlocking the Power of Data Lock 5 Variable Selection The p-value for an explanatory variable can be taken as a rough measure for how helpful that explanatory variable is to the model Insignificant variables may be pruned from the model, as long as adjusted R 2 doesn’t go down too much You can also look at relationships between explanatory variables; if two are strongly associated, perhaps both are not necessary

Statistics: Unlocking the Power of Data Lock 5 Variable Selection (Some) ways of deciding whether a variable should be included in the model or not: 1.Does it improve adjusted R 2 ? 2.Does it have a low p-value? 3.Is it associated with the response by itself? 4.Is it strongly associated with another explanatory variables? (If yes, then including both may be redundant) 5.Does common sense say it should contribute to the model?

Statistics: Unlocking the Power of Data Lock 5 Full Model Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 1 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 2 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 3 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 4 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 6

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 7

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5 FINAL STEPWISE MODEL

Statistics: Unlocking the Power of Data Lock 5 Full Model

Statistics: Unlocking the Power of Data Lock 5 Variable Selection There is no one “best” model Choosing a model is just as much an art as a science Adjusted R 2 is just one possible criteria To learn much more about choosing the best model, take STAT 210

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Cases: countries of the world Response variable: life expectancy Explanatory variable: electricity use (kWh per capita) Is a country’s electricity use helpful in predicting life expectancy?

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Outlier: Iceland

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is this a good model for predicting life expectancy based on electricity use? (a) Yes (b) No The association is definitely not linear.

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is a country’s electricity use helpful in predicting life expectancy? (a) Yes (b) No The p-value for electricity is significant.

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy If we increased electricity use in a country, would life expectancy increase? (a) Yes (b) No (c) Impossible to tell We cannot make any conclusions about causality, because this is observational data.

Statistics: Unlocking the Power of Data Lock 5 Project 2 Part 3: Two Variable Comparisons  Could have been done entirely before Exam 2 Relationships between explanatory variables  Can ignore beauty variables besides bty_avg  Only need to report significant relationships  This is a great part of the project to divvy up Poster  Regular poster board – can get it at Duke bookstore  Include all components of the project, but up to you what to display from each and how to display it

Statistics: Unlocking the Power of Data Lock 5 To Do Read 10.3 Do Homework 8 (due Thursday, 11/29)Homework 8 Do Project 2 (poster due Monday, 12/3, paper due 12/6)Project 2