What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll finish.

Slides:



Advertisements
Similar presentations
I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Advertisements

Describing Quantitative Variables
DESCRIBING DISTRIBUTION NUMERICALLY
Statistical Techniques I EXST7005 Start here Measures of Dispersion.
Descriptive Measures MARE 250 Dr. Jason Turner.
Describing Distributions with Numbers
MEASURES OF SPREAD – VARIABILITY- DIVERSITY- VARIATION-DISPERSION
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Programming in R Describing Univariate and Multivariate data.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Objectives 1.2 Describing distributions with numbers
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Have out your calculator and your notes! The four C’s: Clear, Concise, Complete, Context.
Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
1 PUAF 610 TA Session 2. 2 Today Class Review- summary statistics STATA Introduction Reminder: HW this week.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Measures of Dispersion How far the data is spread out.
Categorical vs. Quantitative…
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
Statistics Lecture 3. Last class: types of quantitative variable, histograms, measures of center, percentiles and measures of spread…well, we shall.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Describing Distributions Numerically.
Numerical Measures. Measures of Central Tendency (Location) Measures of Non Central Location Measure of Variability (Dispersion, Spread) Measures of Shape.
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Describing Data: One Quantitative Variable SECTIONS 2.2, 2.3 One quantitative.
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll.
Unit 3: Averages and Variations Week 6 Ms. Sanchez.
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
1.3 Describing Quantitative Data with Numbers Pages Objectives SWBAT: 1)Calculate measures of center (mean, median). 2)Calculate and interpret measures.
© Willett, Harvard University Graduate School of Education, 1/17/2016S010Y/C05 – Slide 1 What types of data are collected? “Categorical” Data “Continuous”
© Willett, Harvard University Graduate School of Education, 1/28/2016S010Y/C09 – Slide 1 S010Y: Answering Questions with Quantitative Data Class 9/III.2:
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll.
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll.
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll.
© Willett, Harvard University Graduate School of Education, 6/23/2016S010Y/C04 – Slide 1 S010Y: Answering Questions with Quantitative Data Class 4: II.2.
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
Objective: Given a data set, compute measures of center and spread.
Chapter 6 ENGR 201: Statistics for Engineers
Description of Data (Summary and Variability measures)
Laugh, and the world laughs with you. Weep and you weep alone
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Please take out Sec HW It is worth 20 points (2 pts
Topic 5: Exploring Quantitative data
Histograms: Earthquake Magnitudes
Warmup What is the shape of the distribution? Will the mean be smaller or larger than the median (don’t calculate) What is the median? Calculate the.
Honors Stats Chapter 4 Part 6
CHAPTER 1 Exploring Data
Describing Quantitative Data with Numbers
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Summary (Week 1) Categorical vs. Quantitative Variables
Describing Distributions Numerically
Honors Statistics Review Chapters 4 - 5
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
The Five-Number Summary
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Chapter 3: Data Description
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Presentation transcript:

What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll finish their doctorate earlier?  Are computer literates less anxious about statistics?  …. ?  Are men more likely to study part-time?  Are women more likely to enroll in CCE?  …. ? Questions that Require Us To Examine Relationships Between Features of the Participants.  How tall are class members, on average?  How many hours a week do class members report that they study?  …. ?  How many members of the class are women?  What proportion of the class is fulltime?  …. ? Questions That Require Us To Describe Single Features of the Participants “Continuous” Data “Categorical” Data Research Is A Partnership Of Questions And Data © Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 1 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 2 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data arithmetic manipulation principle Today, I’ll focus on generating summaries using the arithmetic manipulation principle. ordering principle Last time, I focused on generating such summaries using the ordering principle. two broad approaches We have distinguished two broad approaches for statistical summaries creating statistical summaries of these properties: Approach #2 arithmetic manipulation of data values Based on the arithmetic manipulation of data values:  Mean, standard deviation, skewness, kurtosis, … Approach #1 ordering of data values Based on the ordering of data values:  Median, quartiles, percentiles, inter-quartile range, … It is more difficult to summarize the sample distribution of a continuous variable, like MAT score, than it is to summarize the sample distribution of a categorical variable, because the sample distributions of continuous variables like MAT scores have so many interesting properties, including: The “center” or “location” of the batch. The “spread” of the batch. The “one-sidedness” of the batch. The “peakiness” of the batch.

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 3 arithmetic principlecenter of the distribution Let’s use the arithmetic principle to develop a statistic for describing the center of the distribution of the values of a continuous variable like MAT score … for the “Early” “Elsewhere” batch, for instance … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data A good summary statistic for describing the center of a distribution of the values of a continuous variable is the place where the distribution would need to be supported so that it could “balance.”

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 4 summary statisticcenter of the distribution of the values of a continuous variable A good summary statistic for describing the center of the distribution of the values of a continuous variable, like MAT score, is the place where the distribution must be supported for it to balance Known as the sample mean, or average. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 5 arithmetic principlespread of the distribution of values of a continuous variable let’s use the arithmetic principle to create a summary statistic for describing the spread of the distribution of values of a continuous variable … how about the “average distance from the center”? Why don’t we just find the average distance of all the “blocks” from the center? S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 6 When you sum, everything goes to zero, so what do we do now …. ? Let’s do what we’ve done before, square all the distances before averaging? Let’s do what we’ve done before, square all the distances before averaging? Now I guess we should take the square root, to reverse the squaring that we did to begin with? standard deviation Let’s call this the standard deviation. Now I guess we should take the square root, to reverse the squaring that we did to begin with? standard deviation Let’s call this the standard deviation. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 7 arithmetic principle And so, creating summary statistics based on the arithmetic principle, here’s the story so far… Mean 63.4 Mean standard deviation S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 8 You don’t have to do all these computations by hand – SAS can do them for you:  Here are the MAT data you worked with, supplemented by data from the 1987 cohort.  All in the MAT.txt dataset.MAT.txt You don’t have to do all these computations by hand – SAS can do them for you:  Here are the MAT data you worked with, supplemented by data from the 1987 cohort.  All in the MAT.txt dataset.MAT.txt (74 cases omitted) Entering cohort: 1 = =1989 Entering cohort: 1 = =1989 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data ID label Location of test site: 1 = Harvard 2 = Elsewhere Location of test site: 1 = Harvard 2 = Elsewhere When the test was received in the Admissions Office: 1 = Early 2 = Late When the test was received in the Admissions Office: 1 = Early 2 = Late Raw MAT score

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 9 OPTIONS Nodate Pageno=1; TITLE1 ‘S010Y: Answering Questions with Quantitative Data'; TITLE2 'Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I'; TITLE3 'MAT Scores from 2 Years of Doctoral Applicants'; TITLE4 'Data in MAT.txt'; * * Input data, name and label variables in dataset * *; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; LABEL ID= 'Case identification number' YEARTEST = 'Year test taken' WHENRECD= 'When application received' MATSCOR= 'Millers Analogies Test Score' TESTSITE= 'Test site'; * * Format labels for values of categorical variables * *; PROC FORMAT; VALUE YEARFMT 1='1987' 2='1989'; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere'; Here’s a PC-SAS program to provide descriptive univariate statistics on these data … Handout C08_1 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Standard data input statements, notice that there are several other variables in the dataset The usual process of formatting the categorical variables

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 10 * * Data Listing * *; PROC PRINT LABEL DATA=MAT; TITLE5 'Listing of MAT Scores & Background Variables for all Applicants'; VAR ID YEARTEST WHENRECD TESTSITE MATSCOR; FORMAT YEARTEST YEARFMT. WHENRECD WHENFMT. TESTSITE SITEFMT.; * * Displaying and summarizing the MAT scores for the whole sample * *; PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Univariate Descriptive Summaries of MAT Score for all Applicants'; VAR MATSCOR; ID ID; RUN; And here’s the rest of the PC_SAS program … this part provides the requested univariate descriptive statistics... S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Printing, titling and formatting a few cases for inspection PROC UNIVARIATE provides all kind of univariate (“single variable”) descriptive statistics for continuous variables The PLOT command requests various data plots, including the stem.leaf plot. The ID command identifies a variables that contains respondent identifying information The VAR command specifies the continuous variable to be summarized

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 11 S010Y: Answering Questions with Quantitative Data Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I MAT Scores from 2 Years of Doctoral Applicants Data in MAT.txt Listing of MAT Scores and Background Variables for all Applicants Case Year When Millers identification test application Analogies Obs number taken received Test site Test Score Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Early Elsewhere Late Elsewhere Late Harvard Late Elsewhere Late Elsewhere Late Elsewhere Late Elsewhere Late Elsewhere Late Elsewhere 54 Here’s a listing of a few cases from the dataset … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Harvard graduation, 1890 The six class day speakers; with W.E.B. Du Bois on the far right Harvard graduation, 1890 The six class day speakers; with W.E.B. Du Bois on the far right Each row is a case, as usual

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 12 Variable: MATSCOR (Millers Analogies Test Score) Moments N 90 Sum Weights 90 Mean Sum Observations 5705 Std Deviation Variance Skewness Kurtosis Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantiles Quantile Estimate 100% Max % % % % Q % Median % Q % % % % Min 18.0 orderingarithmetic manipulation And the “ordering” and “arithmetic manipulation” summary statistics for MATSCOR are … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data The sample mean of MATSCOR is The sample standard deviation of MATSCOR is The median (or 50 th percentile) of MATSCOR is 65 The inter-quartile range is the difference between the upper and lower quartiles: Lower quartile = 53 Upper quartile = 77 Inter-quartile range = (77-53) = 24 The inter-quartile range is the difference between the upper and lower quartiles: Lower quartile = 53 Upper quartile = 77 Inter-quartile range = (77-53) = 24 The range is the difference between the minimum and the maximum: Minimum = 18 Maximum = 96 Range = (96-18) = 78 The range is the difference between the minimum and the maximum: Minimum = 18 Maximum = 96 Range = (96-18) = 78

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 13 Millers Analogies Test Score Stem Leaf # Multiply Stem.Leaf by 10**+1 Millers Analogies Test Score Stem Leaf # Multiply Stem.Leaf by 10**+1 Here’s SAS’s version of the stem.leaf plot for the values of MATSCOR … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data This is scientific notation: And don’t forget the inverses … 1.8 x 10 1 = 18, etc.

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 14 We can bring several of these univariate descriptive statistics – both the “ordering” and “arithmetic manipulation” versions -- together in a useful single summary figure called the “box and whisker” plot, or boxplot… Recall that, for the full sample (n=90) ….  Minimum, Maximum, & Range: Min = 18 Max = 96 Range =78  Quartiles, Median & Inter-Quartile Range: 25 %ile Q1 = 53 Median = %ile Q3 = 77 Interquartile Range = 24  Mean: Mean = 63.4 Recall that, for the full sample (n=90) ….  Minimum, Maximum, & Range: Min = 18 Max = 96 Range =78  Quartiles, Median & Inter-Quartile Range: 25 %ile Q1 = 53 Median = %ile Q3 = 77 Interquartile Range = 24  Mean: Mean = S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 15 The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot | | | | | | *-----* | + | | | | | | | | | | Multiply Stem.Leaf by 10**+1 The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot | | | | | | *-----* | + | | | | | | | | | | Multiply Stem.Leaf by 10**+1 And here’s the PROC UNIVARIATE version of the box-plot from the previous handout….. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data  What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical?  What would the box-plot look like if there was very little variability in MATSCOR in the sample?  What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?  What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical?  What would the box-plot look like if there was very little variability in MATSCOR in the sample?  What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 16 An interesting aside on the normal distribution ….. There is a special relationship between percentiles and standard deviation in a normal distribution Normal distribution simulation Normal distribution simulation Mean +2sd Mean +2sd Mean +1sd Mean +1sd Mean -2sd Mean -2sd Mean - 1sd Mean - 1sd S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”:  Height  Weight,  Test Scores,  Opinions, etc.… A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”:  Height  Weight,  Test Scores,  Opinions, etc.… If you were to plot a vertical histogram of the values of variables like these, you would get the familiar “bell-shaped curve”… Ball-drop simulation

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 17 OPTIONS Nodate Pageno=1; TITLE1 'S010Y: Answering Questions with Quantitative Data'; TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II'; TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP'; TITLE4 'Data in MAT.txt'; * * Input data, name and label variables in dataset * *; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID= 'Case identification number' YEARTEST= 'Year test taken' WHENRECD= 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE= 'Test site'; * * Format labels for the values of the categorical variables * *; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere'; OPTIONS Nodate Pageno=1; TITLE1 'S010Y: Answering Questions with Quantitative Data'; TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II'; TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP'; TITLE4 'Data in MAT.txt'; * * Input data, name and label variables in dataset * *; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID= 'Case identification number' YEARTEST= 'Year test taken' WHENRECD= 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE= 'Test site'; * * Format labels for the values of the categorical variables * *; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere'; The boxplot is very useful if you want to compare sample distributions of a continuous variable like MATSCOR across different groups, as in Activity #1 – see Handout C08_2 … S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Let’s use categorical variables WHENRECD and TESTSITE to sub-divide the sample, so that we can compare sub-sample distributions of MATSCOR using boxplots … like original Activity #1. Here, I’ve picked out only applicants in the 1989 (YEARTEST = 2) cohort, so that the new analyses will match the analyses that you conducted in original Activity #1.

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 18 * * Comparing Distributions of MAT scores across groups of testees * *; PROC SORT DATA=MAT; BY TESTSITE WHENRECD; PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.; * * Comparing Distributions of MAT scores across groups of testees * *; PROC SORT DATA=MAT; BY TESTSITE WHENRECD; PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.; And here’s the rest of the PC-SAS program….. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data To split the sample, first you need to sort it by the categorical variables of interest:  Here, I have sorted first by TESTSITE and then by WHENRECD.  So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere,  The new analyses should therefore have an ordering that matches the ordering in Activity #1. To split the sample, first you need to sort it by the categorical variables of interest:  Here, I have sorted first by TESTSITE and then by WHENRECD.  So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere,  The new analyses should therefore have an ordering that matches the ordering in Activity #1. To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module):  When the “BY” command is implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later). To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module):  When the “BY” command is implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later). Here’s the usual use of PROC UNIVARIATE to generate “single variable” summary statistics for MATSCOR, with the PLOT option exercised.

© Willett, Harvard University Graduate School of Education, 1/31/2016S010Y/C08 – Slide 19 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data Conclusions? Mean scores of those who took the MAT test at Harvard are generally higher than the mean scores of applicants who took the test elsewhere. Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population? The sample distribution of MAT scores is less spread out for those who took the test at Harvard: Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population. The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard: Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail. Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early. Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?. Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric: Most like a sample drawn from the general population? Perhaps those who took the test elsewhere and submitted a late application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time? Conclusions? Mean scores of those who took the MAT test at Harvard are generally higher than the mean scores of applicants who took the test elsewhere. Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population? The sample distribution of MAT scores is less spread out for those who took the test at Harvard: Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population. The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard: Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail. Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early. Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?. Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric: Most like a sample drawn from the general population? Perhaps those who took the test elsewhere and submitted a late application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time?