How could data be used in an EPQ? Nicholas Martindale 4th November 2017
Aims To help students feel confident in using data. To help students use and present data effectively.
Questions What is data? What questions can we ask with data? How can we use data to answer these questions?
1. What is data? A collection of facts or values which can be processed to provide information. Variables Observations
1. What is data? A collection of facts or values which can be processed to provide information. Numerical Variables (e.g. counts, percentages) Categorical Variables (e.g. names, groups)
2. What questions can we ask with data? Representative value: What is a typical value? Spread: How much variation is in the data? Composition: What’s in the data? Distribution: The shape of the data. Comparison: Differences between groups. Trend: Change over time. Relationship: How one thing depends on another. Summarising data Presenting data
3. Answering questions: summaries Representative value: What is a typical value? Spread: How much variation is in the data? Composition: What’s in the data? Distribution: The shape of the data. Comparison: Differences between groups. Trend: Change over time. Relationship: How one thing depends on another. Summarising data Presenting data
3. Answering questions: summaries Representative value: what is a typical value? A typical value can help us summarise a large amount of data with a single representative number. e.g. The mean number of students in primary schools is 271 There are different representative values you could choose from.
3. Answering questions: summaries Representative value: what is a typical value? Measure of Centre Definition Advantages Disadvantages Examples Mean Sum of values / Total number of values Very familiar. Uses all the data. Very large/small values can distort the answer. (1 + 2 + 3) / 3 = 2 (1 + 2 + 33) / 3 = 12 Median Middle value when in order Not affected by very. large/small values. Only depends on the middle values so may not be representative. 1, 2, 3 : median = 2 1, 2, 33: median = 2 0, 0, 0, 0, 0, 0, 2, 9, 9, 9, 9, 9, 9 median = 2 Mode Most common value The only average that can be used with non-numerical data. There may be no mode. There may be more than one mode. 1, 2, 3 : no mode 1, 2, 2, 3: mode = 2
3. Answering questions: summaries Spread: how much variation is in the data? A measure of spread can help us summarise how much the data varies or how unequal it is. e.g. The range of number of students in primary schools is 1455 – 5 = 1450. There are different measures of spread you could choose from.
3. Answering questions: summaries Spread: how much variation is in the data? Measure of Spread Definition Advantages Disadvantages Range Largest – Smallest Easy to calculate. Familiar to students. Not very informative. Distorted by very large/small values. Does not use all the data. Interquartile Range Upper Quartile – Lower Quartile Not distorted by extreme values. Standard Deviation Mean distance of each value from the overall mean Statistically sophisticated. Calculated in Excel/Google/R. Unfamiliar to students. Inappropriate for skewed data.
3. Answering questions: summaries Representative value: What is a typical value? Spread: How much variation is in the data? Composition: What’s in the data? Distribution: The shape of the data. Comparison: Differences between groups. Trend: Change over time. Relationship: How one thing depends on another. Summarising data Presenting data
3. Answering questions: advice on figures What advice do students need on the use of figures?
3. Answering questions: advice on figures How to present figures: Where to put figures: Think first about what you want Soon after they are referred to in the text. to show, then choose a graph. - Text size and font easy to read. How to refer to figures: - Appropriate, informative title. - All figures should have a reference number e.g. “Figure 2” - Clearly labelled axes including units. - All figures used should be referred to in the text e.g. “see Figure 2” Clearly labelled data (groups). Keep them uncluttered.
3. Answering questions: presenting data Which type of figure we use depends on the type of question we are trying to answer: Type of Question Recommended Types of Figure Composition: What’s in the data? Counts: Bar Chart Proportion: Pie Chart Distribution: What’s the shape of the data? Histogram Boxplot Comparison: Differences between groups Side-by-Side Bar Chart Side-by-Side Boxplot Trend: Changes over time Counts: Line graph, Multiple Bar Chart Distribution: Multiple Boxplots Relationship: How one thing depends on another Scatterplot
3. Answering questions: composition What’s in the data? We might want to show counts or proportions. Counts Bar Charts Proportions Pie Charts
3. Answering questions: composition Counts: Bar Chart Include 0 on the y-axis so that you don’t mislead the reader.
3. Answering questions: composition What’s wrong here?
3. Answering questions: composition Make sure to include 0 on the y-axis so that you don’t mislead the reader.
3. Answering questions: composition Proportions Pie Chart - Include percentage labels. 2.0% 3.4% 26.7% 68.0%
3. Answering questions: composition What’s wrong here?
3. Answering questions: composition The perspective in 3D distorts our perception of the relative sizes of the sectors. Avoid using 3D plots, they are often misleading.
3. Answering questions: distribution What’s the shape of the data? Boxplots Histograms
3. Answering questions: distribution Boxplot 50% of schools have a PTR of less (or more) than 19 (median = 19). 25% of schools have a PTR of 17 or less (lower quartile = 17) - 25% of schools have a PTR of 22 or more (upper quartile = 22) - All schools except outliers have a PTR between 8 and 31 (range of whiskers). Lower Quartile Upper Quartile Median
3. Answering questions: distribution Histogram - Same data as boxplot. Data is roughly symmetrical around 20. The mode PTR is about 20. There are very few schools with PTR less than 10 or greater than 30. Mode
3. Answering questions: distribution Skewed Data: when the data is not symmetrical Left Skew (long left tail) Right Skew (long right tail)
3. Answering questions: comparison How do groups in the data differ? Counts/Proportions Side-by-Side Bar Charts Distributions Side-by-Side Boxplots
3. Answering questions: comparison Side-by-Side Bar Charts Side-by-Side Boxplots
3. Answering questions: trend How does the data change over time? We might want to show how counts, proportions or distributions change over time. Counts line graphs, multiple bar charts Proportions stacked bar graphs Distributions multiple boxplots
3. Answering questions: trend Counts over time: Line Graph
3. Answering questions: trend Proportions over time: Stacked Proportion Bar Chart Each bar represents all schools in a given year. The proportion of each type of school is represented by its height within the bar.
3. Answering questions: trend Distributions over time: Multiple Boxplots The median is increasing, so the typical school is growing larger over time. The interquartile range is increasing, so the difference in size between smaller and larger schools is increasing.
3. Answering questions: relationship Does the value of one variable depend on another? - We’ve already seen cases where the value of a numerical variable depends on the value of a categorical variable i.e. - % teachers over 50 depends on the type of school. - Number of schools depends on the phase and the type of school.
3. Answering questions: relationship Does the value of one variable depend on another? When we want to check if one numerical variable depends on the value of another numerical variable we check to see if they is a correlation between them. We use scatterplots to visually assess the relationship between two variables.
3. Answering questions: relationship Does the value of one variable depend on another? The value of a correlation is defined as being between -1 and 1. This “correlation coefficient” can be calculated easily in Excel or Google Sheets.
3. Answering questions: relationship Positive Correlation Correlation coefficient = 0.7 As the PTR increases the % of pupils achieving 5 A*-C also increases
3. Answering questions: relationship Negative Correlation Correlation coefficient = -0.8 As the % FSM increases the % of pupils achieving 5 A*-C decreases.
3. Answering questions: relationship No Correlation (or very little) Correlation coefficient = 0.06 There doesn’t seem to be a relationship between the % male teachers and the % 5 A*-C
Conclusion How to use data depends on the question being asked. Type of Question Recommended Use of Data Representative Value: What’s a typical value? Mean, Median, Mode Spread: How much variation is in the data? Range, Interquartile Range, Standard Deviation Composition: What’s in the data? Counts: Bar Chart Proportion: Pie chart Distribution: What’s the shape of the data? Histogram Boxplot Comparison: Differences between groups Side-by-Side Bar chart Side-by-Side Boxplot Trend: Changes over time Counts: Line graph, Multiple Bar chart Distribution: Multiple Boxplots Relationship: How one thing depends on another Scatterplot How to use data depends on the question being asked.