Sit in your permanent seat QM222 Class 4 Section D1 Reviewing descriptive statistics and distributions, making scatter diagrams, and correlation coefficients Sit in your permanent seat QM222 Fall 2016 Section D1
Today we will.. Review of descriptive statistics (with Excel) Scatter diagrams in Excel and Stata Correlation in Excel and Stata QM222 Fall 2016 Section D1
Assignment 1 What is the data set you plan to use? What is main variable or variables in this data set that you plan to predict or explain? What specific question or questions will your project address? What company, governmental body or other organization would be interested in knowing the answer to this question? QM222 Fall 2016 Section D1
Review QM222 Fall 2016 Section D1
Descriptive Statistics -- review We discussed means, medians, and when they will give different results. We discussed measures of spread-outness (dispersion) like the standard deviation, and the value at different percentiles (10%, 25%, 50%, 75% 90%) QM222 Fall 2016 Section D1
Distributions Distributions graph the likelihood of each X value on the Y- axis v. the X variable itself. There are similar to histograms, except that: In distributions, the intervals are tiny The Y-axis is the % of cases, not the # of cases Therefore the area beneath a distribution adds to 1 (100%). QM222 Fall 2016 Section D1
Normal Distribution A “Normal distribution” looks like a symmetric bell curve Symmetric means that the right side of the mean is a mirror image of the left side Bell curves look like a bell. Notation here: μ is the mean, and σ is the standard deviation Approximately 68% (or around 2/3rds) of the observations are within one standard deviation of the mean. Approximately 95% of the observations are within two standard deviations of the mean. Do problem sets on your own – it is the best way to learn the material. Mistakes on problem sets are not excessively penalized There may be a pop quiz on the problem set in section when it is due (with p=.5) QM222 Fall 2016 Section D1
Excel team practice in Descriptive Statistics Open the file on sites.bu.edu/qm222projectcourse/other materials/data and other materials used in class: Class 2 ACS Business Major Earnings 2012 Hints: =AVERAGE() =MEDIAN() =STDEV() =MIN(), =(MAX) =PERCENTILE(range, 0.20) (for example) Or, in Excel Data--In Data Analysis- Descriptive Statistics , you can get all of these statistics. Answer this Q: Is this distribution “normal”? List several ways you know. Excel Formula Value Mean Median Standard deviation Range 5th percentile 95th percentile QM222 Fall 2016 Section D1
Descriptive statistics in Stata . sum Earnings Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- Earnings | 2200 78376.55 67653.98 0 382000 . sum Earnings, detail Earnings ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 13913.5 0 10% 23044 0 Obs 2200 25% 40000 0 Sum of Wgt. 2200 50% 60984 Mean 78376.55 Largest Std. Dev. 67653.98 75% 93754.5 382000 90% 147483 382000 Variance 4.58e+09 95% 201000 382000 Skewness 2.424376 99% 382000 382000 Kurtosis 10.11437 . QM222 Fall 2016 Section D1
Relationship between 2 variables QM222 Fall 2016 Section D1
Scatterplots can tell us The direction (sign) of relationship between two variables (is the slope positive or negative?) The form of the relationship: linear vs. curved The strength of relationship If there are outliers QM222 Fall 2016 Section D1
Example: The Midwest seems to have the best SAT math scores Example: The Midwest seems to have the best SAT math scores. But is this because fewer high schoolers in the Midwest take the SAT? QM222 Fall 2016 Section D1
Example: The Midwest seems to have the best SAT math scores Example: The Midwest seems to have the best SAT math scores. But is this because fewer high schoolers in the Midwest take the SAT? QM222 Fall 2016 Section D1
Use a scatter plot! Each dot represents one “observation”, one data point QM222 Fall 2016 Section D1
Use a scatter plot! Each dot represents one “observation”, one data point What is an observation in this data set? A state. If I made a line, would the slope be positive or negative? Negative Would a line or a curve fit better? Probably a curve. Is the relationship strong? Hmmm…. kind of Are there outliers? Not really far out ones. QM222 Fall 2016 Section D1
Making scatter diagrams in Excel In class exercise: Class 4: Open UniversityAdmissions_SAT.xlsx (a data set from NYC) in on sites.bu.edu/qm222projectcourse/other materials/data and other materials used in class Place the two columns you want in your graph side-by-side. The variable you want on the x-axis should be on the left. Make sure the top row of each column has a descriptive label for the variable. On the Insert tab, click the picture of a scatter diagram and then click on the first scatter with only markers and with no connecting lines. What does each observation represent? Make a scatter diagram with the school’s math mean score on the Y-axis and the school’s reading score on the X-axis. QM222 Fall 2016 Section D1
Your scatter diagram from Excel… QM222 Fall 2016 Section D1
Making a scatter diagram in Stata graph twoway scatter MathematicsMean ReadingMean QM222 Fall 2016 Section D1
We’d also like a numerical measure of how closely two variables move together: the Correlation coefficient The correlation (coefficient) tells us two things: The direction of association: When X goes up, does Y go up or down? The strength of the association: How closely related are Y and X, or, how strong is the link? It doesn’t tell us if the relationship is linear or curved – In fact, it assumes that the relationship is linear. QM222 Fall 2016 Section D1
Correlation coefficient: notation r or ρ A positive correlation coefficient means: that when we see a higher value for one variable, we also tend to see a higher value for the other variable. A negative correlation coefficient means that when we see a higher value for one variable, we tend to see a lower value for the other variable. QM222 Fall 2016 Section D1
Correlation coefficient A correlation coefficient that is zero means that there is no correlation If you did a scatter of X and Y, the dots would seem to have no relationship. QM222 Fall 2016 Section D1
The correlation coefficient is between 1 & -1 Closer to |1| means a stronger association When r = 1 there is perfect positive correlation; if you did a scatter of X and Y, the dots would all lie exactly on an upward sloping line. When r = -1 there is perfect negative correlation; if you did a scatter of X and Y, the dots would all lie exactly on a downward sloping line. When r = 0 there is no correlation; if you did a scatter of X and Y, the dots would seem to have no relationship with each other. If you were to fit a line to the dots, it would be flat (since Y doesn’t change as X changes). QM222 Fall 2016 Section D1
How do you think the correlation coefficients compare in Figure A and Figure B below? QM222 Fall 2016 Section D1
How do you think the correlation coefficients compare in Figure A and Figure B below? Both are positive. Figure B fits more tightly around the line – its correlation coefficient is closer to 1. The fact that one is steeper doesn’t affect the correlation. QM222 Fall 2016 Section D1
Correlation in Excel To get the correlation (between 2 variables in Excel, =CORREL(range X, range Y) (Or, in Excel Data--In Data Analysis- Correlation, you can get the correlation between a all variables in a range.) In-Class exercise using UniversityAdmissions_SAT.xlsx: 1. Get the correlation between the math and reading school mean scores. 2. Get the correlation between the number of test takers and the reading mean scores. QM222 Fall 2016 Section D1
In Stata correlate MathematicsMean ReadingMean NumberofTestTakers (obs=78) | Mathem~n Readin~n Number~s -------------+--------------------------- Mathematic~n | 1.0000 ReadingMean | 0.8831 1.0000 NumberofTe~s | 0.0712 -0.0033 1.0000 QM222 Fall 2016 Section D1
Interpreting the values of correlation Measured correlations are almost never exactly 0, 1, or –1 A claim that two variables are uncorrelated typically means that the correlation is “near” 0 No absolute standard for what is a strong correlation, what is a weak correlation, and what is no correlation QM222 Fall 2016 Section D1
Correlation v. relation The correlation coefficient measures the strength of linear relationship. A low value is not enough to conclude a lack of a strong link between the two variables. This picture has a near zero correlation … The two variables are very related, but it’s not a line with a single slope, but. QM222 Fall 2016 Section D1
Correlation does not mean Causation (i.e. one thing causes another) https://www.youtube.com/watch?v=8B271L3NtAw QM222 Fall 2016 Section D1
Why correlation does not imply causation Possible explanations for correlation between x and y: X causes Y a change in X will change Y. Y causes X a change in Y will change X X causes Y AND Y causes X this is known as simultaneity Another variable(s) cause both X and Y this is called a confounding factor QM222 Fall 2016 Section D1
Let’s go through the examples in the video… which is it?: A. X causes Y B. Y causes X C. X causes Y AND Y causes X (simultaneity) D. Another variable(s) cause both X and Y (confounding factor) Ice cream (X) causes drownings (Y). Married men live longer than single men. Infants who sleep with the lights on tend to grow up short-sighted. Self esteem causes good grades. QM222 Fall 2016 Section D1
Assignment 2 paraphrased (from sites.bu.edu/qm222projectcourse) What specific question or questions will your project address? What company, governmental body or other organization would be interested in knowing the answer to this question? What data source(s) are you using? In your data, what does each observation represent? What is the dependent variable(s) you plan to focus on? (Need the name from the dataset or how you are going to make it from other variables from the data set. What is the main explanatory variable(s) that you will focus on? (Need name from dataset or how you are making it, as above.) What additional, possibly confounding variables, can you measure that you planning to include in your analysis? (Again, use the specific variable name in the dataset.) QM222 Fall 2016 Section D1