Plan and Data
Are you aware of concepts such as sample, population, sample distribution, population distribution, sampling variability?
PPDAC
The plan The goal in a sampling process is to obtain a sample to represent the population of interest. For us, this means choosing an appropriate sample size.
What makes a good sample In common language usage, a sample is representative of the population if characteristics in the sample are a reflection of those in the parent population.
Under this meaning, a truly representative sample almost never exists. In statistical jargon a representative sample means that the sampling process produces samples in which there is no tendency for certain characteristics to differ from those in the population in some systematic manner, e.g., all random samples could be viewed as representative samples.
Sample size The aim of statistical testing is to uncover a significant difference when it actually exists. In its simplest form this involves comparing samples.
Sample size is important because Larger samples increase the chance of finding a significant difference (if it exists), but Larger samples cost more money. Sometimes a larger sample is not a possibility.
Sample size In general, the larger the sample size, the better the sample reflects the characteristics of the population.
Sample size Larger sample sizes also help to give a better idea of the shape of the distribution.
Sample size So the sample size is chosen to maximise the chance of uncovering a specific mean/median difference, which is also statistically significant.
Note: The specific difference and statistically significant are two quite different ideas.
Remember box plots- this is what we want to produce so we can comment on what we notice and then what we infer – more about this later
The specific difference and statistically significant are two quite different ideas. Here the medians have a specific difference but the difference is not statistically significant more about this later.
The specific difference (difference in means and/or medians) is found by the researcher in terms of the outcome measure of an experiment or investigation. In this Achievement Standard, we are going to perform an investigation that compares two sets of data.
Examples of specific difference For instance, difference in mean right foot length between Year 11 boys and mean right foot length in Year 11 girls from the 2012 Census at Schools database; 3kg mean weight change in a diet experiment, 10% mean improvement in a teaching method experiment.
Statistical significance is a probability statement telling us how likely it is that the observed difference was due to chance only.
The reason larger samples increase your chance of significance is because they more reliably reflect the population mean/median.
PLAN Your plan must: define the variables you will investigate; decide how you will measure these variables; note what things might affect the measures you take (managing sources of variation); decide how many measures you need to collect (sample size); explain how your data will be obtained and recorded.
define the variables you will investigate; The variable we are investigating is the length of the right foot of Year 13 boys and Year 13 girls. The lengths are measured in cm.
decide how you will measure these variables; note what things might affect the measures you take (managing sources of variation); If you are taking these measurements yourself, you will standardise the method: E.g. To minimise measurement errors, the measurements will be taken with the shoe removed and from the longest toe to the back of the heel. To get consistent measurements, I will get each person to place their right foot against the wall and mark the position of the longest toe using a ruler. I will check that the foot is at right angles to the wall.
decide how you will measure these variables; note what things might affect the measures you take (managing sources of variation); If you download measurements talk about your reservations about how the measurements were taken. E.g. As each person measured their own foot for the database, it is unlikely that the measurement method used was the same for everyone and hence the measurements will contain measurement errors.
decide how many measures you need to collect (sample size);
Note: You need to choose a Discrete and a Continuous variable. Discrete: There are a lot of data the same e.g. gender, site Continuous: The data are generally all different e.g. weight, height, length
decide how many measures you need to collect (sample size); “I will get our two random samples using the 2011 CensusAtSchool random sampler. I will take a random sample of 25 boys from the population of 13 year-old NZ boys in the 2011 CensusAtSchool database. I will take a random sample of 25 girls from the population of 13 year-old NZ girls in the 2011 CensusAtSchool database.”
Talk about your sample Ask yourself “Is it reasonable that these samples are representative of the population?” “Is it reasonable to assume that samples taken from the Census at Schools database would represent all Year 13 students in New Zealand?”
Here is a ‘worry’ list I worry about the quality of the foot length data since students measured and recorded their own foot lengths. Were measurements made with shoes on or shoes off? Would all students have seen ‘cm’ to the right of the entry box? To what level of precision did the students make their measurement? Why were there missing values?
We need to mention that we are concerned about the accuracy of the data and that this could be improved by having the data collected in exactly the same way with a more detailed requirement e.g. The students place their foot against the wall and the measurement is taken rounded to the nearest cm from the wall to the the end of the longest toe.
Data
Census at Schools Data Viewer
What should I be concerned about? These are the questions that were asked.
What should I be concerned about? How do we know if the student was actually a Year 13 student?
What should I be concerned about? Can we be sure that the students took off their shoe to measure? How accurately did they measure? Can we be sure that the students took off their shoe to measure? How accurately did they measure?
Getting the data
We didn’t get exactly 25 in each category- does this matter?
Sample size It is not necessary to have equal sample sizes as long as the samples are representative of the population.
24 in each sample
Data summary
Was 24 data points enough to talk about the distribution, shape etc.?
The shape is not obvious but we should think about what we expect from foot-length
We could argue that the distribution is likely to be symmetrical as we would expect a cluster of data in the middle and some extreme values either side.
Because equal sample size doesn’t matter, I have just asked for a total of 100
The shape is now becoming clearer.
This no longer says ‘Year 13’ so I wonder if there is enough data for this sample size
It is better to collect data from Year 9 or 10 as there is likely to be more data available for these year levels.