GS/PPAL Section N Research Methods and Information Systems

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
Inference for Regression
Correlation and regression
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 12 Simple Regression
The Simple Regression Model
Chapter Topics Types of Regression Models
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
GS/PPAL Section N Research Methods and Information Systems A QUANTITATIVE RESEARCH PROJECT - (1)DATA COLLECTION (2)DATA DESCRIPTION (3)DATA ANALYSIS.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Lecture 5 Correlation and Regression
Correlation and Linear Regression
Lecture 16 Correlation and Coefficient of Correlation
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Linear Regression and Correlation
Section #6 November 13 th 2009 Regression. First, Review Scatter Plots A scatter plot (x, y) x y A scatter plot is a graph of the ordered pairs (x, y)
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Correlation and Regression
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Introduction to Linear Regression
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lecture 10: Correlation and Regression Model.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
T tests comparing two means t tests comparing two means.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter 13 Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Regression and Correlation
Chapter 4 Basic Estimation Techniques
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Regression Analysis AGEC 784.
10.2 Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is.
Correlation and Simple Linear Regression
Inference for Regression
Linear Regression and Correlation Analysis
Chapter 11: Simple Linear Regression
Lecture Slides Elementary Statistics Twelfth Edition
CHAPTER 12 More About Regression
Chapter 11 Simple Regression
Elementary Statistics
Simple Linear Regression
Correlation and Simple Linear Regression
I271B Quantitative Methods
Correlation and Regression
CHAPTER 29: Multiple Regression*
Chapter 9 Hypothesis Testing.
Prepared by Lee Revere and John Large
Multiple Regression Models
Correlation and Simple Linear Regression
Correlation and Regression
Product moment correlation
Lecture Slides Elementary Statistics Twelfth Edition
CHAPTER 12 More About Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

GS/PPAL 6200 3.00 Section N Research Methods and Information Systems A QUANTITATIVE RESEARCH PROJECT - DATA COLLECTION DATA DESCRIPTION DATA ANALYSIS

Correlations Is CGPA related in some systematic way to total hours studied (H)? Remember, we need to account for the fact that they each tend to deviate from their true mean randomly. The “correlation coefficient” for a set of observations is a function of how much each of the observed values deviate from the sample means adjusted for (i.e., not explained by) random deviation

Correlations and Predictions Presence of a (linear) correlation may offer predictive information that may be useful It may (but may not) suggest causality to be examined further - “correlation does not imply causation” (when there is no control group) It may suggest policy considerations (policy action, spillover effects, consequences)

Representing Linear Correlation For a population, the typical notation is: ρ (H,C) = corr(H,C) = cov (H,C)/σHσC = 1/(n-1) * Σ [(H-μH)(C- μC)]/ σHσC For a sample from that same population (changing the notation to indicate the calculations are for the sample): r (H, C) = 1/(n-1) * Σ [(Hi-avgH)(Ci- avgC)]/ sHsC Excel program to calculate (2) above: = CORREL (data array (H), data array (CGPA)), OR = PEARSON (data array (H), data array (CGPA))

Population Correlation Coefficient The Pearson correlation coefficient (numbers above images) measures only the linear relationship between two variables "Correlation examples2" by Denis Boigelot, original uploader was Imagecreator - Own work, original uploader was Imagecreator. Licensed under CC0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Correlation_examples2.svg#/media/File:Correlation_examples2.svg

Correlation Coefficient (= 0.816) versus Visual Inspection of Data "Anscombe's quartet 3" by Anscombe.svg: Schutzderivative work (label using subscripts): Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg#/media/File:Anscombe%27s_quartet_3.svg

10-case Study Raw Data Scatter Plot with Linear Trend Case CGPA Total Hours Studied 1 7.67 35 2 6.83 29 3 4.17 23 4 50 5 5.00 32 6 22 7 17 8 7.33 40 9 44 10 6.33 38

Correlation for 10-case Study = CORREL (CGPA, HOURS) = PEARSON (CGPA, HOURS) = 0.7944 R-squared = 0.7944 * 0.7944 = 0.63 If CGPA is a linear function of HOURS and CGPA is normally distributed, then R-squared gives the “explained variance” or 63% if the variation in CGPA can be “explained” by variation in HOURS

Strength versus Significance A “strong” correlation may or may not be significant A “weak” correlation may or may not be significant Key is the size of the sample – for small samples a strong correlation may still be by chance; for large samples it is easy to achieve significance for weak correlations

Representing Linear Relationships Since CGPA and HOURS appear to be strongly positively correlated (but it may only be an artifact of the small sample size) and statistically significant (despite being a small sample) then examine relationship more closely General linear relationship: Y = mX + b for Y dependent variable, X independent or explanatory variable, and b some constant

Graphically Locate coordinates (2, 4) that is, X = 2, Y = 4 When X increases by +1 (from 2 to 3) how much does Y increase by? (=m) When X = 0, what does Y equal? (= b) Therefore model is Y = 1*X + 2

CGPA and HOURS For the linear trend line, CGPA = Intercept (b) + coefficient (m) * HOURS CGPA = 2.6 + 0.105*HOURS For every +1 hour studied per month, by how much does CGPA increase? How did we obtain the linear trend line?

Regression Analysis - Intuition The estimated linear trend line specifies the linear relationship that “best fits” the data A “best fit” model is one that minimizes the amount an observation deviates from the hypothesized model “Best fit” here means to minimize the sum of the squared deviations between the data points and the linear trend line (model) “Linear Least Squares Regression Model”

Regression Analysis - Mechanics In Excel: “Data Analysis”  “Regression” Dependent Variable: CGPA Coefficients: values of “b” (intercept) and “m” coefficient on independent (explanatory) variable Standard Error, t-stat, P-value and CI (95%) for each estimate

Data Interpretation (again) From the Regression Output we know: CGPA = 2.6 + 0.1058*HOURS For every +1 hour studied, CGPA on graduation increases by 0.1058 Graduating students with +1 grade point higher than other graduating students, studied on average + 9.43 more hours per month (9.52 = 1 / 0.106) And 95% CI suggests underlying (unobserved) population mean lies somewhere between 5.8 and 25 hours per month)

Significance The linear correlation between hours studied (independent variable) and CGPA (dependent variable) suggests a possible (causal) relationship. But is the relationship “significant” statistically? Or did it occur by chance? Or is it an artifact of the small sample size and related only to sampling error? Our next question: What is the likelihood that the relationship we observe is simply due to sampling error or chance?

Significance Level and p-Values Significance Level (α): Probability of rejecting the null hypothesis when it is true (α=1%, 5% or 10%) P-value: Probability of observing this event (probability of obtaining a result equal to or more extreme that what is actually observed) – given that the null hypothesis is true P-value < α, the data are inconsistent with the null hypothesis  reject H0 P-value > α, the data are consistent with the null hypothesis  cannot reject H0

P-value If the null hypothesis is true, what is the probability of obtaining values equal to or more extreme (greater or less) than what is observed in our data? If the null hypothesis for our academic performance study is that there is no relationship between HOURS and CGPA (i.e., H0: m = 0), what is the probability that we will observe m = 0.106? Probability P-value = 0.0061, much less than 0.05 = 5% (or 1% or 10%) level of significance = the rate of falsely rejecting H0 = rate of committing Type I error → therefore reject H0

t-statistic An interval distance of +0.1 may or may not be “large” depending on the overall variation around the average (mean) The interval distance between an observed value and the mean (or a hypothesized mean) of the variable needs to be adjusted or standardized to account for the overall variation t-statistic for the sample = [estimated(m)- hypothesized(m)]/SE which has an approximately normal distribution with n-2 degrees of freedom

Significance Level and t-tests If the null hypothesis is that m = 0, we want to know if the estimated value of m = 0.106 is significantly different from m = 0 t-stat = [estimated (m) – hypothesized (m)]/SE = (0.106 – 0)/0.0286 = 3.7 Is this standardized difference of 3.7 units significantly different from 0 at 95% for this sample size? Critical value for the t-stat = 2.306 (see next slide) t-stat = 3.7 > 2.306 → difference is significantly different → reject H0: m = 0 → data support HA

t-stat critical values Use Excel to calculate the critical value for = T.INV.2T(α, DF) = T.INV.2T(0.05, 8) = 2.306

Statistical Significance: Summary P-value approach: P-value = 0.0061 < .05 or the probability this coefficient is obtained purely by chance is less than 5%  reject H0  data support HA (H0: coefficient on HOURS = 0; HA: ≠ 0) t-stat = 3.7 > 2.306 → 0.106 is statistically significantly different from 0 → reject H0: m = 0 → data support HA : m ≠ 0

Research Conclusion Highly unlikely that the observed correlation occurred by chance; data support the hypothesis that hours studying is (positively) correlated with academic performance as measured by CGPA at graduation Linear regression suggests that students with a higher +1 GPA at graduation studied an estimated +9.5 hours/month more every month than did students with a lower GPA But the small sample size means a large Confidence Interval → population mean lies somewhere between 5.8* hours/month and 25* hours/month (95% of the time) [*take bounds on CI for m and convert to hours/month)