Chapter 1 – Statistics I 01 Learning Outcomes In this chapter you have learned about: How to find, collect and organise data How to generate data from other sources Types of data Populations and samples How to select a sample How to use stem-and-leaf plots and histograms to display data Scatterplots, correlation and line of best fit
Eye colour: blue, green, brown 01 Statistics I Types of Data Definitions Categorical data: Questions that cannot be answered with numbers. Examples: eye colour, exam grades. There are two types of categorical data, ordinal (can be ordered) and nominal (no order). Example Exam grades: A, B, C, D, E, F Example Eye colour: blue, green, brown Numerical data: Questions that can be answered with numbers. Examples: height, shoe size. There are two types of numeric data, continuous (infinite values) and discrete (definite values). Example Height: 1∙65 m, 1∙756m Example Shoe size: 4, 4½, 6, 7½ Categorical data Numerical data Ordinal Nominal Continuous Discrete Types of data
Primary and Secondary Data 01 Statistics I Primary and Secondary Data Definitions Primary data: Data collected by or for the person who is going to use it. Observational studies: Researcher collects information but does not influence events. A study into the TV viewing habits of teenagers. Data is collected by means of a questionnaire. Designed experiments: Some treatment is applied to a group and the effects are observed. Pharmaceutical companies carry out designed experiments when they are testing new drugs. NOTE: The drug is an explanatory variable. The effect of the drug is a response variable. Secondary data: Data is not collected by the person who is going to use it. Sources for secondary data include the Internet, newspapers, books and databases.
01 Statistics I Sample Surveys Definitions Population: This is the entire group that is being studied. Sample: This is a group that is being studied from the population in order to gather information. A census: This is a survey of the whole population. Bias in sampling: Samples that are not representative are called biased samples. Simple random sample: A sample in which all have an equal chance of being selected. Stratified random sample: The population is divided into at least two subgroups, then a simple random sample is drawn from each subgroup. Systematic random sample: The population is divided into at least two subgroups, where there is a definite number in each subgroup then a simple random sample is drawn from each subgroup. Cluster sample: A sample in which the population is divided into clusters. Then some these clusters are randomly selected and all members from these clusters are chosen. Quota sample: A non-probability sampling method. A sample is selected to fill a certain prescribed percentage of people who come from various subgroups.
01 Statistics I Collecting Data The most common way of collecting data is by survey. Most surveys use a questionnaire. The survey can be carried out by: Face-to-face interview Telephone interview Sending a questionnaire by post Making a questionnaire available online Observation Advantages and disadvantages of each type of survey Survey Advantages Disadvantages Face-to face interview Questions can be explained. Not random. Expensive to carry out. Telephone interview Can sample nearly the entire population. Expensive in comparison to postal and online surveys. Postal questionnaire Inexpensive. People do not always reply to postal surveys. Online questionnaire Very low cost. Anonymity ensures more honest answers. Not representative. Only those who go online are represented. Observation Low cost. Easy to administer. Not suitable for all surveys. Questions can’t be explained.
Stem-and-Leaf Diagrams 01 Statistics I Stem-and-Leaf Diagrams Stem-and-leaf diagrams represent data in a similar way to bar charts. 15, 45, 19, 11, 14, 13, 57, 38, 25, 51, 47, 46, 23, 62, 56, 21, 33, 48, 44, 16. Twenty people from the audience of a TV programme are randomly selected and each person is asked his/her age. Their ages are as follows: Represent the data on a stem-and-leaf diagram. Ordered data 11, 13, 14, 15, 16, 19, 21, 23, 25, 33, 38, 44, 45, 46, 47, 48, 51, 56, 57, 62. 1 2 3 4 5 6 1 3 4 5 6 9 LEAVES 1 3 5 3 8 STEM 4 5 6 7 8 1 6 7 2 Key: 1|4 = 14
Back-to-Back Stem-and-Leaf Diagrams 01 Statistics I Back-to-Back Stem-and-Leaf Diagrams These are a useful way of comparing data from two different groups. Researchers for Consumer Reports analysed two types of hot dog: one with meat and one with poultry. The number of calories in each hot dog was recorded. The results are as follows: Represent the data on a back-to-back stem-and-leaf diagram. 8 9 10 11 12 13 14 15 16 17 18 6 7 4 9 7 2 6 7 2 3 5 9 9 8 6 5 2 5 7 6 2 2 3 4 6 7 2 2 6 7 7 6 2 5 2 2 6 7 Key: 2|18 = 182 calories Key: 15|2 = 152 calories
Histograms and Distribution Of Data 01 Statistics I Histograms and Distribution Of Data The histogram shows the times (in minutes) taken by a group of 14 students to complete a maths problem: We call this a symmetric distribution. 2 4 6 8 10 The histogram shows the times (in minutes) taken by a group of 21 students to complete the same maths problem: We call this a negatively skewed distribution or a skewed left distribution. 2 4 6 8 10 The histogram shows the times of a group of 25 students who also completed the maths problem: We call this a positively skewed distribution or a skewed right distribution. 2 4 6 8 10
Scatter Graphs and Correlation 01 Statistics I Scatter Graphs and Correlation Scatter graphs are used to investigate relationships between two sets of numerical data. The table below show the height (cm) and arm-span (cm) measurements of a group of students. Height 160 170 165 159 161 163 166 167 169 171 177 Arm-span 168 162 164 175 Scatter graph of data Possible outlier (An outlier is a data point that is distant from other data points) Line of best fit The points lie reasonably close to a straight line (line of best fit). Therefore, in general, the greater the height, the greater the arm span.
01 Statistics I Types of Correlation x y x y x y x y x y It is important that you state both the direction (positive or negative) and the strength of a correlation when asked for the type of correlation. x y 10 20 30 5 15 x y 10 20 30 5 15 x y 10 20 30 5 15 Line of best fit by eye Line of best fit by eye Line of best fit by eye Strong positive correlation Correlation coefficient ≈ 0∙98 Strong negative correlation Correlation coefficient ≈−0∙98 Weak positive correlation Correlation coefficient ≈ 0∙5 x y 10 20 30 5 15 x y 10 20 30 5 15 Line of best fit by eye Weak negative correlation Correlation coefficient ≈ − 0∙5 No correlation Correlation coefficient ≈ − 0∙12