What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data? Do people who say they study for more hours also think they’ll finish their doctorate earlier? Are computer literates less anxious about statistics? …. ? Are men more likely to study part-time? Are women more likely to enroll in CCE? …. ? Questions that Require Us To Examine Relationships Between Features of the Participants. How tall are class members, on average? How many hours a week do class members report that they study? …. ? How many members of the class are women? What proportion of the class is fulltime? …. ? Questions That Require Us To Describe Single Features of the Participants “Continuous” Data “Categorical” Data Research Is A Partnership Of Questions And Data © Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 1 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 2 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data arithmetic manipulation principle Today, I’ll focus on generating summaries using the arithmetic manipulation principle. ordering principle Last time, I focused on generating such summaries using the ordering principle. two broad approaches We have distinguished two broad approaches for statistical summaries creating statistical summaries of these properties: Approach #2 arithmetic manipulation of data values Based on the arithmetic manipulation of data values: Mean, standard deviation, skewness, kurtosis, … Approach #1 ordering of data values Based on the ordering of data values: Median, quartiles, percentiles, inter-quartile range, … It is more difficult to summarize the sample distribution of a continuous variable, like MAT score, than it is to summarize the sample distribution of a categorical variable, because the sample distributions of continuous variables like MAT scores have so many interesting properties, including: The “center” or “location” of the batch. The “spread” of the batch. The “one-sidedness” of the batch. The “peakiness” of the batch.
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 3 arithmetic principlecenter of the distribution Let’s use the arithmetic principle to develop a statistic for describing the center of the distribution of the values of a continuous variable like MAT score … for the “Early” “Elsewhere” batch, for instance … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data A good summary statistic for describing the center of a distribution of the values of a continuous variable is the place where the distribution would need to be supported so that it could “balance.”
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 4 summary statisticcenter of the distribution of the values of a continuous variable A good summary statistic for describing the center of the distribution of the values of a continuous variable, like MAT score, is the place where the distribution must be supported for it to balance Known as the sample mean, or average. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 5 arithmetic principlespread of the distribution of values of a continuous variable let’s use the arithmetic principle to create a summary statistic for describing the spread of the distribution of values of a continuous variable … how about the “average distance from the center”? Why don’t we just find the average distance of all the “blocks” from the center? S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 6 When you sum, everything goes to zero, so what do we do now …. ? Let’s do what we’ve done before, square all the distances before averaging? Let’s do what we’ve done before, square all the distances before averaging? Now I guess we should take the square root, to reverse the squaring that we did to begin with? standard deviation Let’s call this the standard deviation. Now I guess we should take the square root, to reverse the squaring that we did to begin with? standard deviation Let’s call this the standard deviation. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 7 arithmetic principle And so, creating summary statistics based on the arithmetic principle, here’s the story so far… Mean 63.4 Mean standard deviation S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 8 You don’t have to do all these computations by hand – SAS can do them for you: Here are the MAT data you worked with, supplemented by data from the 1987 cohort. All in the MAT.txt dataset.MAT.txt You don’t have to do all these computations by hand – SAS can do them for you: Here are the MAT data you worked with, supplemented by data from the 1987 cohort. All in the MAT.txt dataset.MAT.txt (74 cases omitted) Entering cohort: 1 = =1989 Entering cohort: 1 = =1989 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data ID label Location of test site: 1 = Harvard 2 = Elsewhere Location of test site: 1 = Harvard 2 = Elsewhere When the test was received in the Admissions Office: 1 = Early 2 = Late When the test was received in the Admissions Office: 1 = Early 2 = Late Raw MAT score
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 9 OPTIONS Nodate Pageno=1; TITLE1 ‘S010Y: Answering Questions with Quantitative Data'; TITLE2 'Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I'; TITLE3 'MAT Scores from 2 Years of Doctoral Applicants'; TITLE4 'Data in MAT.txt'; * * Input data, name and label variables in dataset * *; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; LABEL ID= 'Case identification number' YEARTEST = 'Year test taken' WHENRECD= 'When application received' MATSCOR= 'Millers Analogies Test Score' TESTSITE= 'Test site'; * * Format labels for values of categorical variables * *; PROC FORMAT; VALUE YEARFMT 1='1987' 2='1989'; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere'; Here’s a PC-SAS program to provide descriptive univariate statistics on these data … Handout C08_1 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Standard data input statements, notice that there are several other variables in the dataset The usual process of formatting the categorical variables
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 10 * * Data Listing * *; PROC PRINT LABEL DATA=MAT; TITLE5 'Listing of MAT Scores & Background Variables for all Applicants'; VAR ID YEARTEST WHENRECD TESTSITE MATSCOR; FORMAT YEARTEST YEARFMT. WHENRECD WHENFMT. TESTSITE SITEFMT.; * * Displaying and summarizing the MAT scores for the whole sample * *; PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Univariate Descriptive Summaries of MAT Score for all Applicants'; VAR MATSCOR; ID ID; RUN; And here’s the rest of the PC_SAS program … this part provides the requested univariate descriptive statistics... S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Printing, titling and formatting a few cases for inspection PROC UNIVARIATE provides all kind of univariate (“single variable”) descriptive statistics for continuous variables The PLOT command requests various data plots, including the stem.leaf plot. The ID command identifies a variables that contains respondent identifying information The VAR command specifies the continuous variable to be summarized
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 11 S010Y: Answering Questions with Quantitative Data Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I MAT Scores from 2 Years of Doctoral Applicants Data in MAT.txt Listing of MAT Scores and Background Variables for all Applicants Case Year When Millers identification test application Analogies Obs number taken received Test site Test Score Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Late Elsewhere Late Harvard Late Elsewhere Late Elsewhere Late Elsewhere Late Elsewhere Late Elsewhere Late Elsewhere 54 Here’s a listing of a few cases from the dataset … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Harvard graduation, 1890 The six class day speakers; with W.E.B. Du Bois on the far right Harvard graduation, 1890 The six class day speakers; with W.E.B. Du Bois on the far right Each row is a case, as usual
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 12 Variable: MATSCOR (Millers Analogies Test Score) Moments N 90 Sum Weights 90 Mean Sum Observations 5705 Std Deviation Variance Skewness Kurtosis Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantiles Quantile Estimate 100% Max % % % % Q % Median % Q % % % % Min 18.0 orderingarithmetic manipulation And the “ordering” and “arithmetic manipulation” summary statistics for MATSCOR are … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data The sample mean of MATSCOR is The sample standard deviation of MATSCOR is The median (or 50 th percentile) of MATSCOR is 65 The inter-quartile range is the difference between the upper and lower quartiles: Lower quartile = 53 Upper quartile = 77 Inter-quartile range = (77-53) = 24 The inter-quartile range is the difference between the upper and lower quartiles: Lower quartile = 53 Upper quartile = 77 Inter-quartile range = (77-53) = 24 The range is the difference between the minimum and the maximum: Minimum = 18 Maximum = 96 Range = (96-18) = 78 The range is the difference between the minimum and the maximum: Minimum = 18 Maximum = 96 Range = (96-18) = 78
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 13 Millers Analogies Test Score Stem Leaf # Multiply Stem.Leaf by 10**+1 Millers Analogies Test Score Stem Leaf # Multiply Stem.Leaf by 10**+1 Here’s SAS’s version of the stem.leaf plot for the values of MATSCOR … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data This is scientific notation: And don’t forget the inverses … 1.8 x 10 1 = 18, etc.
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 14 We can bring several of these univariate descriptive statistics – both the “ordering” and “arithmetic manipulation” versions -- together in a useful single summary figure called the “box and whisker” plot, or boxplot… Recall that, for the full sample (n=90) …. Minimum, Maximum, & Range: Min = 18 Max = 96 Range =78 Quartiles, Median & Inter-Quartile Range: 25 %ile Q1 = 53 Median = %ile Q3 = 77 Interquartile Range = 24 Mean: Mean = 63.4 Recall that, for the full sample (n=90) …. Minimum, Maximum, & Range: Min = 18 Max = 96 Range =78 Quartiles, Median & Inter-Quartile Range: 25 %ile Q1 = 53 Median = %ile Q3 = 77 Interquartile Range = 24 Mean: Mean = S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 15 The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot | | | | | | *-----* | + | | | | | | | | | | Multiply Stem.Leaf by 10**+1 The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot | | | | | | *-----* | + | | | | | | | | | | Multiply Stem.Leaf by 10**+1 And here’s the PROC UNIVARIATE version of the box-plot from the previous handout….. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical? What would the box-plot look like if there was very little variability in MATSCOR in the sample? What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median? What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical? What would the box-plot look like if there was very little variability in MATSCOR in the sample? What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 16 An interesting aside on the normal distribution ….. There is a special relationship between percentiles and standard deviation in a normal distribution Normal distribution simulation Normal distribution simulation Mean +2sd Mean +2sd Mean +1sd Mean +1sd Mean -2sd Mean -2sd Mean - 1sd Mean - 1sd S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”: Height Weight, Test Scores, Opinions, etc.… A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”: Height Weight, Test Scores, Opinions, etc.… If you were to plot a vertical histogram of the values of variables like these, you would get the familiar “bell-shaped curve”… Ball-drop simulation
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 17 OPTIONS Nodate Pageno=1; TITLE1 'S010Y: Answering Questions with Quantitative Data'; TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II'; TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP'; TITLE4 'Data in MAT.txt'; * * Input data, name and label variables in dataset * *; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID= 'Case identification number' YEARTEST= 'Year test taken' WHENRECD= 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE= 'Test site'; * * Format labels for the values of the categorical variables * *; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere'; OPTIONS Nodate Pageno=1; TITLE1 'S010Y: Answering Questions with Quantitative Data'; TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II'; TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP'; TITLE4 'Data in MAT.txt'; * * Input data, name and label variables in dataset * *; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID= 'Case identification number' YEARTEST= 'Year test taken' WHENRECD= 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE= 'Test site'; * * Format labels for the values of the categorical variables * *; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere'; The boxplot is very useful if you want to compare sample distributions of a continuous variable like MATSCOR across different groups, as in Activity #1 – see Handout C08_2 … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Let’s use categorical variables WHENRECD and TESTSITE to sub-divide the sample, so that we can compare sub-sample distributions of MATSCOR using boxplots … like original Activity #1. Here, I’ve picked out only applicants in the 1989 (YEARTEST = 2) cohort, so that the new analyses will match the analyses that you conducted in original Activity #1.
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 18 * * Comparing Distributions of MAT scores across groups of testees * *; PROC SORT DATA=MAT; BY TESTSITE WHENRECD; PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.; * * Comparing Distributions of MAT scores across groups of testees * *; PROC SORT DATA=MAT; BY TESTSITE WHENRECD; PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.; And here’s the rest of the PC-SAS program….. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data To split the sample, first you need to sort it by the categorical variables of interest: Here, I have sorted first by TESTSITE and then by WHENRECD. So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere, The new analyses should therefore have an ordering that matches the ordering in Activity #1. To split the sample, first you need to sort it by the categorical variables of interest: Here, I have sorted first by TESTSITE and then by WHENRECD. So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere, The new analyses should therefore have an ordering that matches the ordering in Activity #1. To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module): When the “BY” command is implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later). To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module): When the “BY” command is implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later). Here’s the usual use of PROC UNIVARIATE to generate “single variable” summary statistics for MATSCOR, with the PLOT option exercised.
© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 19 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Conclusions? Mean scores of those who took the MAT test at Harvard are generally higher than the mean scores of applicants who took the test elsewhere. Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population? The sample distribution of MAT scores is less spread out for those who took the test at Harvard: Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population. The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard: Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail. Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early. Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?. Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric: Most like a sample drawn from the general population? Perhaps those who took the test elsewhere and submitted a late application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time? Conclusions? Mean scores of those who took the MAT test at Harvard are generally higher than the mean scores of applicants who took the test elsewhere. Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population? The sample distribution of MAT scores is less spread out for those who took the test at Harvard: Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population. The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard: Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail. Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early. Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?. Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric: Most like a sample drawn from the general population? Perhaps those who took the test elsewhere and submitted a late application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time?