Class Data (Major) Ungrouped data: Public Health Other Psychology Cog Science Biology Biology Psychology Public Health Other Cog Science Cog Science Psychology Other Biology Public Health Other Psychology Public Health Biology Cog Science Ungrouped data: A set of scores or categories distributed individually, where the frequency for each individual score or category is counted. Natural, distinct Categories/groupings What kind of data is this? What is the scale of measurement?
Class Data (Major) Major Count of Major Psychology 22 Biology 17 Cognitive Science 12 Public Health Other Total 80 Notice: each category is represented by a different rectangle. Also notice: each rectangle does not touch along the x-axis. This is meant to show that they are discrete, distinct categories. A bar chart is a graphical display used to summarize the frequency of discrete and categorical data that are distributed in whole units or classes.
Class Data (Major) Major Count of Major Psychology 22 Biology 17 Cognitive Science 12 Public Health Other Total 80 A graphical display is the shape of a circle that is used to summarize the relative percent of categorical data.
Class Data (Year) Natural, distinct Categories/groupings Ordered? Sophomore Senior Freshman Junior Sophomore Freshman Junior Sophomore Junior Freshman Sophomore Freshman Junior Senior Natural, distinct Categories/groupings Ordered? What kind of data is this? What is the scale of measurement?
Class Data (Year) Year Count of Year Freshman 11 Sophomore 60 Junior 7 Senior 2 Total 80
Definitions Grouped data: Interval: A set of scores distributed into intervals, where the frequency of each score can fall into only one interval. Interval: A range of values within which the frequency of a subset of scores is contained.
Steps to Summarize Grouped Data Step 1: Find the real range The real range is one more than the difference between the largest and smallest value in a list of data Step 2: Find the interval width The interval width is the range of scores in each interval There should be between 5 and 20 intervals. Step 3: Construct the frequency distribution
Steps to Summarize Grouped Data Step 1: Find the real range The range is 100 - 35 = 65. The real range is 65 + 1 = 66 94 84 95 65 90 62 92 58 88 86 97 96 73 93 91 64 71 78 100 87 85 47 35 81 69 53 77
Steps to Summarize Grouped Data (cont.) Step 2: Find the interval width Number of intervals is 6 Interval width is real range divided by the number of intervals: 66/6 = 11 Step 3: Construct the frequency distribution
Steps to Summarize Grouped Data (cont.) Intervals f(x) 90-100 17 79-89 10 68-78 5 57-67 46-56 2 35-45 1 Total 40 Intervals f(x) 89.5-100.5 17 78.5-89.5 10 67.5-78.5 5 56.5-67.5 45.5-56.5 2 34.5-45.5 1 Total 40 Rules for a simple frequency distribution: Each interval is defined Each interval is equal length No interval overlaps Book version: slightly easier to construct and understand My version: used for creating the visualization
Cumulative Frequency Distributes the sum of frequencies across a series of intervals Add from bottom up: Discuss in terms of “less than”, “at or below” a certain value, or “at most” Add from top down: Discuss in terms of “greater than”, “at or above” a certain value, or “at least” Intervals f(x) Cum. Freq. (bottom up) Cum. Freq. (top down) 90-100 17 40 79-89 10 23 27 68-78 5 13 32 57-67 8 37 46-56 2 3 39 35-45 1 Total
Relative Frequency Distributes the proportion of scores in each interval Equals the frequency in an interval divided by the total number of frequencies Often used to summarize large data sets Intervals f(x) Relative Frequency Cum. Rel. Frequency 90-100 17 0.425 1.000 79-89 10 0.250 0.575 68-78 5 0.125 0.325 57-67 0.200 46-56 2 0.050 0.075 35-45 1 0.025 Total 40
Class Data (Excitement) 3 4 2 5 1 4 5 3 2 4 3 1 5 3 5 2 4 1 In the perspective of scores on a continuum from 1 to 5, categories/groupings arbitrary
Creating a Histogram Histogram: Rules for creating a histogram: Summarizes the frequency of continuous, grouped (or ungrouped) data. Rules for creating a histogram: Rule 1: Vertical rectangles represent each interval, and the height of the rectangle equals the frequency recorded for each interval. Rule 2: The base of each rectangle begins and ends at the upper and lower boundaries of each interval. Rule 3: Each rectangle touches adjacent rectangles at the boundaries of each interval.
Class Data (Excitement) Intervals f(x) 1-5 80 Total Frequency = Number of observations 80 observations between 0.5 and 5.5 (i.e., all data) Not very useful Intervals f(x) 0.5-5.5 80 Total Point out why we set the boundaries as .5s even though we don’t actually have that as options, clearer on next slide
Class Data (Excitement) Intervals f(x) 5-6 10 3-4 61 1-2 9 Total 80 More informative e.g., 61 observations between 2.5 and 4.5 (i.e., 3’s and 4’s) Intervals f(x) 4.5-6.5 10 2.5-4.5 61 0.5-2.5 9 Total 80
Class Data (Excitement) Intervals f(x) 5 10 4 17 3 44 2 6 1 Total 80 Even more informative e.g., 10 observations between 4.5 and 5.5 (i.e., 5’s) Intervals f(x) 4.5-5.5 10 3.5-4.5 17 2.5-3.5 44 1.5-2.5 6 0.5-1.5 3 Total 80
Class Data (Excitement) Relative Frequency = Frequency/Total e.g. 10/80 = 0.125
Definitions Frequency polygon: Ogive: A figure that summarizes the frequency of continuous data at the midpoint of each interval. Ogive: A figure that summarizes the cumulative frequency of continuous data at the upper boundary of each interval.
Class Data (Excitement) Connect midpoints of each class/group
Class Data (Excitement) Frequency Polygon by itself
Class Data (Excitement) e.g., cumulative frequency up to 3.5 is 3 + 6 + 44 = 53 Ogive always starts at 0 and ends at total
Class Data (Excitement) e.g., cumulative relative frequency up to 3.5 is 0.0375 + 0.075 + 0.55 = 0.6625 Relative Frequency Ogive always starts at 0 and ends at 1
Scatterplot Scatterplot (called a scattergram in the book): A display of paired data points (x, y) that summarizes the relationship between two variables. Data points are plotted to see whether a pattern emerges.
Measures on Central Tendency The “center” of a distribution. A measure of central tendency is a statistical measure that tends toward the center of a distribution. They are used to locate a single score that is most representative or descriptive of all the scores in a distribution. They can help us know if the distribution tends to be composed of high or low scores. The types: Mean Median Mode
Mode Most frequent value in a data set 53, 67, 75, 75, 84, 91, 91, 91, 94, 99 Mode = 91 (one mode) 53, 75, 75, 75, 84, 91, 91, 91, 94, 99 Mode = 75 and 91 (two modes) 53, 72, 73, 75, 84, 91, 92, 93, 94, 99 Mode = none (no mode)