Summary.

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Chapter 13 Multiple Regression
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
Chapter 12 Simple Regression
BA 555 Practical Business Analysis
Chapter 12 Multiple Regression
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Chapter Topics Types of Regression Models
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
Multiple Regression and Correlation Analysis
Linear Regression Example Data
Ch. 14: The Multiple Regression Model building
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Chapter 7 Forecasting with Simple Regression
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Correlation & Regression
Correlation and Regression
Lecture 15 Basics of Regression Analysis
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Chapter 13: Inference in Regression
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
STA291 Statistical Methods Lecture 27. Inference for Regression.
SUMMARY. Two-sided t-test This is not an exact formula! It just demonstrates main ingrediences. difference between means, i.e. variability between samples.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Chapter 14 Introduction to Multiple Regression
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Introduction to Linear Regression
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Lecture 4 Introduction to Multiple Regression
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Lecture 10: Correlation and Regression Model.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
Data Analysis.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Linear Regression and Correlation Chapter GOALS 1. Understand and interpret the terms dependent and independent variable. 2. Calculate and interpret.
Statistics for Managers Using Microsoft® Excel 5th Edition
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Chapter 14 Introduction to Multiple Regression
Inference for Least Squares Lines
Correlation and Regression
Inferential Statistics
Presentation transcript:

summary

ANOVA What is it? How does it work? What is an F-ratio? What is a grand mean? What are the degrees of freedom for the F-ratio? k-1, N-k

Post hoc tests F-test in ANOVA is the so-called omnibus test. It tests the means globally. It says nothing about which particular means are different. post hoc tests, multiple comparison tests Tukey Honestly Significant Differences > TukeyHSD(fit) # where fit comes from aov()

New stuff

ANOVA assumptions normality – all samples are from normal distribution homogeneity of variance (homoscedasticity) – variances are equal independence of observations – the results found in one sample won't affect others Most influencial is the independence assumption. Otherwise, ANOVA is relatively robust. We can sometimes violate normality – large sample size variance homogeneity – equal sample sizes + the ratio of any two variances does not exceed four Nonparametric equivalent – Kruskal-Wallis test

ANOVA kinds one-way ANOVA (analýza rozptylu při jednoduchém třídění, jednofaktorová ANOVA) aov(beer_brands$Price~beer_brands$Brand) two-way ANOVA (analýza rozptylu dvojného třídění, dvoufaktorová ANOVA) Example: engagement ratio, measure two educational methods (with and without song) for men and women independently aov(engagement~method+sex) interactions between factors dependent variable independent variable

Report statistical results I Descriptive statistics mean, s.d. Confidence intervals confidence level (e.g., 95%) lower limit upper limit CI on what (e.g., on a mean)? APA style See, for example, http://my.ilstu.edu/~jhkahn/apastats.html Confidence interval on the mean difference; 95% CI = (4,6)

Report statistical results II Hypothesis test kind of test (e.g., one-sample t-test) the actual value of the test statistic (e.g., the value of t) d.f. p-value if applicable, give a direction of test (e.g., one-tailed or two-tailed) 𝛼 level! APA style for reporting results of the hypothesis test t(df) = X.XX, p = X.XX, direction e.g. t(24) = -2.50, p = 0.01, one-tailed

correlation

Introduction Up to this point we've been working with only one variable. Now we are going to focus at two variables. Two variables that are probably related. Can you think of some examples? weight and height time spent studying and your grade temperature outside and ankle injuries

Car data x – predictor, explanatory, independent variable Miles on a car Value of the car 60 000 $12 000 80 000 $10 000 90 000 $9 000 100 000 $7 500 120 000 $6 000 x – predictor, explanatory, independent variable y – outcome, response, dependent variable

Car data How may we show these variables have a relationship? Miles on a car Value of the car 60 000 $12 000 80 000 $10 000 90 000 $9 000 100 000 $7 500 120 000 $6 000 How may we show these variables have a relationship? Tell me some of yours ideas. scatterplot

Scatterplot

Stronger relationship?

Correlation Relation between two variables = correlation strong relationship = strong correlation, high correlation Match these strong positive strong negative weak positive weak negative

Correlation coefficient r (Pearson's r) - a number that quantifies the relationship. 𝑟= 𝑟 𝑥𝑦 = 1 𝑛−1 𝑥 𝑖 − 𝑥 𝑦 𝑖 − 𝑦 𝑠 𝑋 𝑠 𝑌 Miles on a car Value of the car 60 000 $12 000 80 000 $10 000 90 000 $9 000 100 000 $7 500 120 000 $6 000

+1 -1 +0.14 +0.93 -0.73

Guessing Correlations Try to guess correlation coefficients at http://www.istics.net/Correlations/

Coefficient of determination Coefficient of determination - 𝑟 2 is the percentage of variation in Y explained by the variation in X. Percentage of variance in one variable that is accounted for by the variance in the other variable. r2 = 0 r2 = 0.25 r2 = 0.81 from http://www.sagepub.com/upm-data/11894_Chapter_5.pdf

Crickets Find a cricket, count the number of its chirps in 15 seconds, add 37, you have just approximated the outside temperature in degrees Fahrenheit. National Service Weather Forecast Office: http://www.srh.noaa.gov/epz/?n=wxcalc_cricketconvert chirps in 15 sec temperature 18 57 27 68 20 60 30 71 21 64 34 74 23 65 39 77

Hypothesis testing Even when two variables describing the sample of data may seem they have a relationship, this could be just due to the chance. The situation in population may be different. 𝑟 … sample corr. coeff., 𝜌 … population corr. coeff. How a hypotheses will look like? 𝐻 0 :𝑟=0 𝐻 𝐴 :𝑟<0 𝑟>0 𝑟≠0 𝐻 0 :𝜌=0 𝐻 𝐴 :𝜌<0 𝜌>0 𝜌≠0 𝐻 0 :𝑟<0 𝑟>0 𝑟≠0 𝐻 𝐴 :𝑟=0 𝐻 0 :𝑟<0 𝑟>0 𝑟≠0 𝐻 𝐴 :𝑟=0 A B C D

Hypothesis testing 𝑡= 𝑟 𝑛−2 1− 𝑟 2 with 𝑑𝑓=𝑛−2 test statistic has a t-distribution Example: we measure the relationship between two variables, we have 25 participants, we get t = 2.71. Is there a significant relationship between X and Y? 𝛼=0.05, non-directonal test, 𝑡 𝑐𝑟𝑖𝑡 =2.069

Correlation vs. causation causation – one variable causes another to happen For example, the fact that it is raining causes people to take their umbrellas . correlation – just means there is a relationship For example, do happy people have more friends? Are they just happy because they have more friends? Or they act a certain way which causes them to have more friends.

Correlation vs. causation There is a strong relationship between the ice cream consumption and the crime rate. However, if you stop selling ice cream, does the crime rate drop? What do you think? So how could this be true? Outside temperature. from causeweb.org

Correlation vs. causation Outside temperature is a variable we did not realize to control. Such variable is called third variable, confounding variable, lurking variable. The methodologies of scientific studies therefore need to control for these factors to avoid a 'false positive‘ conclusion that the dependent variables are in a causal relationship with the independent variable.

Correlation vs. causation That’s because correlation expresses the association between two or more variables; it has nothing to do with causality. In other words, just because the level of ice cream consumption and crime rate increase/descrease together does not mean that a change in one necessarily results in a change in the other. You can’t interpret associations as being causal.

http://xkcd.com/552/

Correlation and regression analysis Correlation analysis investigates the relationships between variables using graphs or correlation coefficients. Regression analysis answers the questions like: which relationship exists between variables X and Y (linear, quadratic ,….), is it possible to predict Y using X, and with what error?

Simple linear regression also single linear regression (jednoduchá lineární regrese) one y (dependent variable, závisle proměnná), one x (independent variable, nezávisle proměnná) 𝑦 = 𝑎 + 𝑏𝑥 𝑎 – y-intercept (constant), 𝑏 – slope 𝑦 is estimated value, so to distinguish it from the actual value 𝑦 corresponding to the given 𝑥 statisticans use 𝑦

Data set Students in higher grades carry more textbooks. Weight of the textbooks depends on the weight of the student.

strong positive correlation, r = 0.926 outlier from Intermediate Statistics for Dummies

Build a model Find a straight line y = a + bx

Interpretation y-intercept (3.69 in our case) may or may not have practical meaning Does it fall within actual values in the data set? Does it fall within negative territory where negative y-value are not possible? (e.g. weights can’t be negative) Does a value x = 0 have practical meaning (student weighting 0)? However, even if it has no meaning, it may be necessary (i.e. significantly different from zero)! slope change in y due to one-unit increase in x (i.e. if student’s weight increases by 1 pound, its textbook’s weight increases by 0.113 pounds) now you can use regression line to estimate y value for new x

Regression model conditions After building a regression mode you need to check if the required conditions are met. What are these conditions? The y’s have normal distribution for each value of x. The y’s have constant spread (standard deviation) for each value of x.

Normal y’s for every x For any value of x, the population of possible y-values must have a normal distribution. from Intermediate Statistics for Dummies

Homoscedasticity condition As you move from the left to the right on the x-axis, the spread around y-values remain the same. source: wikipedia.org

Residuals To check the normality of y-values you need to measure how far off your predictions were from the actual data, and to explore these errors. residual (residuum, reziduální hodnota predikce) 𝑒=𝑦− 𝑦

residual actual value predicted value from Intermediate Statistics for Dummies

Residuals The residuals are data just like any other, so you can find their mean (which is zero!!) and their standard deviation. Residuals can be standardized, i.e. converted to the Z-score so you see where they fall on the standard normal distribution. Plotting residuals on the graph – residual plots.

normality of residuals homoscedasticity residuals independence

Using r2 to measure model fit r2 measures what percentage of the variability in y is explained by the model. The y-values of the data you collect have a great deal of variability. You look for another variable (x) that helps to explain the variability in the y-values. After you put x into the model and you find it’s highly correlated with y, you want to find out how well this model did at explaining why the values of y are different.

Interpreting r2 high r2 (80-90% is extremely high, 70% is fairly high) A high percentage of variability means that the line fits well because there is not much left to explain about the value of y other than using x and its relationship to y. small r2 (0-30%) The model containing x doesn’t help much in explaining the difference in the y-values The model would not fit well. You need another variable to explain y other than the one you already tried. middle r2 (30-70%) x does help somewhat in explaining y, but it doesn’t do the job well enough on its own. Add one or more variables to the model to help explain y more fully as a group. Textbook example: r = 0.93, r2 = 0.8649. Approximately 86% of variability you find in textbook weights is explained by the average student weight. Fairly good model.

Multiple regression Two (or more) variables are better than one. y = b0 + b1x1 + b2x2 + … + bkxk Steps in the analysis Check the relationships between each x variable and y (using scatterplots and correlations) and use the results to eliminate those x variables that aren’t strongly related to y. Look at possible relationships between the x variables themselves to make sure you aren’t being redundant (in statistical terms, you’re trying to avoid the problem of multicolinearity). If two x variables relate to y the same way, you don’t need both in the model. Use selected x variables in a multiple regression analysis to find the best-fitting model for your data. Use the best-fitting model to predict y for given x-values.

Data set Relate plasma TV sales with two types of advertisement . from Intermediate Statistics for Dummies

Pinpoint Possible Relationships Scatterplot from Intermediate Statistics for Dummies

Correlations All possible correlations TV vs. Sales Newspaper vs. Sales Newspaper vs. TV from Intermediate Statistics for Dummies

Is correlation coefficient ρ statistically significant? What are null and alternative hypotheses? Ho: ρ = 0, Ha: ρ ≠ 0 Investigate p-values. If p-value is smaller than α (typically 0.05), reject Ho. from Intermediate Statistics for Dummies

Checking for Multicolinearity Look at the relationship between the x variables themselves and check for redundancy. Multicolinearity – two x variables are highly correlated. If two x variables are significantly correlated, include only one. If you include both, the computer won’t know what numbers to give as coefficients for each of the two variables, because they share their contribution to determining the value of y. Multicolinearity can really mess up the model-fitting. from Intermediate Statistics for Dummies

Find the best fitting model from Intermediate Statistics for Dummies

The interpretation of coefficients are a little more complicated then in simple linear regression. The coefficient of an x variable in a multiple regression model is the amount by which y changes if that x variable increases by one and the values of all other x variables in the model don’t change. Plasma TV sales increases by 0.162 million dollars when TV ad spending increases by $1,000 and spending on newspaper ads doesn’t change.

Testing the Coefficients Determine whether you have the right x variables in your model. Test H0: Coef = 0, Ha: Coef ≠ 0

Extrapolation: no-no Do not estimate y for values of x outside their range! There is no guarantee that the relationship you found follows the same model for distant values of predictors.

Checking the fit of model The residuals have a normal distribution with mean zero. The residuals have the same variance for each fitted (predicted) value of y. The residuals are independent (don’t affect each other).

normality of residuals homoscedasticity residuals independence from Intermediate Statistics for Dummies

Adjusted R2 How well the regression line approximates the real data points is measured by the coefficient of determination R2 = r2. It tells you how much the variability in the y-value is explained by the model. R2 increases as we increase the number of variables in the model. As such, R2 alone cannot be used as a meaningful comparison of models with different numbers of independent variables – adjusted R2.