Presentation is loading. Please wait.

Presentation is loading. Please wait.

GS/PPAL Section N Research Methods and Information Systems

Similar presentations


Presentation on theme: "GS/PPAL Section N Research Methods and Information Systems"— Presentation transcript:

1 GS/PPAL 6200 3.00 Section N Research Methods and Information Systems
A QUANTITATIVE RESEARCH PROJECT - DATA COLLECTION DATA DESCRIPTION DATA ANALYSIS

2 Correlations Is CGPA related in some systematic way to total hours studied (H)? Remember, we need to account for the fact that they each tend to deviate from their true mean randomly. The “correlation coefficient” for a set of observations is a function of how much each of the observed values deviate from the sample means adjusted for (i.e., not explained by) random deviation

3 Correlations and Predictions
Presence of a (linear) correlation may offer predictive information that may be useful It may (but may not) suggest causality to be examined further - “correlation does not imply causation” (when there is no control group) It may suggest policy considerations (policy action, spillover effects, consequences)

4 Representing Linear Correlation
For a population, the typical notation is: ρ (H,C) = corr(H,C) = cov (H,C)/σHσC = 1/(n-1) * Σ [(H-μH)(C- μC)]/ σHσC For a sample from that same population (changing the notation to indicate the calculations are for the sample): r (H, C) = 1/(n-1) * Σ [(Hi-avgH)(Ci- avgC)]/ sHsC Excel program to calculate (2) above: = CORREL (data array (H), data array (CGPA)), OR = PEARSON (data array (H), data array (CGPA))

5 Population Correlation Coefficient
The Pearson correlation coefficient (numbers above images) measures only the linear relationship between two variables "Correlation examples2" by Denis Boigelot, original uploader was Imagecreator - Own work, original uploader was Imagecreator. Licensed under CC0 via Wikimedia Commons -

6 Correlation Coefficient (= 0.816) versus Visual Inspection of Data
"Anscombe's quartet 3" by Anscombe.svg: Schutzderivative work (label using subscripts): Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons -

7 10-case Study Raw Data Scatter Plot with Linear Trend Case CGPA
Total Hours Studied 1 7.67 35 2 6.83 29 3 4.17 23 4 50 5 5.00 32 6 22 7 17 8 7.33 40 9 44 10 6.33 38

8 Correlation for 10-case Study
= CORREL (CGPA, HOURS) = PEARSON (CGPA, HOURS) = R-squared = * = 0.63 If CGPA is a linear function of HOURS and CGPA is normally distributed, then R-squared gives the “explained variance” or 63% if the variation in CGPA can be “explained” by variation in HOURS

9 Strength versus Significance
A “strong” correlation may or may not be significant A “weak” correlation may or may not be significant Key is the size of the sample – for small samples a strong correlation may still be by chance; for large samples it is easy to achieve significance for weak correlations

10 Representing Linear Relationships
Since CGPA and HOURS appear to be strongly positively correlated (but it may only be an artifact of the small sample size) and statistically significant (despite being a small sample) then examine relationship more closely General linear relationship: Y = mX + b for Y dependent variable, X independent or explanatory variable, and b some constant

11 Graphically Locate coordinates (2, 4) that is, X = 2, Y = 4
When X increases by +1 (from 2 to 3) how much does Y increase by? (=m) When X = 0, what does Y equal? (= b) Therefore model is Y = 1*X + 2

12 CGPA and HOURS For the linear trend line, CGPA = Intercept (b) + coefficient (m) * HOURS CGPA = *HOURS For every +1 hour studied per month, by how much does CGPA increase? How did we obtain the linear trend line?

13 Regression Analysis - Intuition
The estimated linear trend line specifies the linear relationship that “best fits” the data A “best fit” model is one that minimizes the amount an observation deviates from the hypothesized model “Best fit” here means to minimize the sum of the squared deviations between the data points and the linear trend line (model) “Linear Least Squares Regression Model”

14 Regression Analysis - Mechanics
In Excel: “Data Analysis”  “Regression” Dependent Variable: CGPA Coefficients: values of “b” (intercept) and “m” coefficient on independent (explanatory) variable Standard Error, t-stat, P-value and CI (95%) for each estimate

15 Data Interpretation (again)
From the Regression Output we know: CGPA = *HOURS For every +1 hour studied, CGPA on graduation increases by Graduating students with +1 grade point higher than other graduating students, studied on average more hours per month (9.52 = 1 / 0.106) And 95% CI suggests underlying (unobserved) population mean lies somewhere between 5.8 and 25 hours per month)

16 Significance The linear correlation between hours studied (independent variable) and CGPA (dependent variable) suggests a possible (causal) relationship. But is the relationship “significant” statistically? Or did it occur by chance? Or is it an artifact of the small sample size and related only to sampling error? Our next question: What is the likelihood that the relationship we observe is simply due to sampling error or chance?

17 Significance Level and p-Values
Significance Level (α): Probability of rejecting the null hypothesis when it is true (α=1%, 5% or 10%) P-value: Probability of observing this event (probability of obtaining a result equal to or more extreme that what is actually observed) – given that the null hypothesis is true P-value < α, the data are inconsistent with the null hypothesis  reject H0 P-value > α, the data are consistent with the null hypothesis  cannot reject H0

18 P-value If the null hypothesis is true, what is the probability of obtaining values equal to or more extreme (greater or less) than what is observed in our data? If the null hypothesis for our academic performance study is that there is no relationship between HOURS and CGPA (i.e., H0: m = 0), what is the probability that we will observe m = 0.106? Probability P-value = , much less than 0.05 = 5% (or 1% or 10%) level of significance = the rate of falsely rejecting H0 = rate of committing Type I error → therefore reject H0

19 t-statistic An interval distance of +0.1 may or may not be “large” depending on the overall variation around the average (mean) The interval distance between an observed value and the mean (or a hypothesized mean) of the variable needs to be adjusted or standardized to account for the overall variation t-statistic for the sample = [estimated(m)- hypothesized(m)]/SE which has an approximately normal distribution with n-2 degrees of freedom

20 Significance Level and t-tests
If the null hypothesis is that m = 0, we want to know if the estimated value of m = is significantly different from m = 0 t-stat = [estimated (m) – hypothesized (m)]/SE = (0.106 – 0)/ = 3.7 Is this standardized difference of 3.7 units significantly different from 0 at 95% for this sample size? Critical value for the t-stat = (see next slide) t-stat = 3.7 > → difference is significantly different → reject H0: m = 0 → data support HA

21 t-stat critical values
Use Excel to calculate the critical value for = T.INV.2T(α, DF) = T.INV.2T(0.05, 8) = 2.306

22 Statistical Significance: Summary
P-value approach: P-value = < .05 or the probability this coefficient is obtained purely by chance is less than 5%  reject H0  data support HA (H0: coefficient on HOURS = 0; HA: ≠ 0) t-stat = 3.7 > → is statistically significantly different from 0 → reject H0: m = 0 → data support HA : m ≠ 0

23 Research Conclusion Highly unlikely that the observed correlation occurred by chance; data support the hypothesis that hours studying is (positively) correlated with academic performance as measured by CGPA at graduation Linear regression suggests that students with a higher +1 GPA at graduation studied an estimated +9.5 hours/month more every month than did students with a lower GPA But the small sample size means a large Confidence Interval → population mean lies somewhere between 5.8* hours/month and 25* hours/month (95% of the time) [*take bounds on CI for m and convert to hours/month)


Download ppt "GS/PPAL Section N Research Methods and Information Systems"

Similar presentations


Ads by Google