Frequency Distributions Lecture 2: Frequency Distributions
Exploratory Data Analysis Emphasizes graphic representation Addresses the following questions: 1. What is going on here? 2. Are there patterns in these data? 3. What are the patterns in these data? Allows for hypothesis generation and model building through iteration EDA is an attitude toward data - Tukey
Distributions Distributions: arrangement of data; how the cases fall (are distributed)! Information about distributions is paramount in statistics. Distributions can be displayed in tabular and graphic form Frequency distributions: tabulation of the number of events/occurrences for each category on the scale of measurement Frequency: a count of the number of occurrences
Frequency Table Pet type frequency proportion % dog cat fish turtle 30 25 20 2 77 0.39 0.32 0.26 0.03 1.0 39 32 26 3 100
Grouping Data Arrange data from high to low Frequencies can be used to find the total number of scores: f = N X from a frequency distribution = fX Proportion = p = f/N Percentage = p (100) = f/N (100) Score X frequency p % 15 10 4 20 11 46 0.43 0.33 0.24 1.0 43 33 24 100
X f p % 10 2 9 5 8 4 7 3 6 Finish filling the chart. Then find N andX
X f ΣX Proportion % 10 2 20 .11 11 9 5 45 .28 28 8 4 32 .22 22 7 3 21 .17 17 6 12 N = 18 X = 140 1.0 100
Frequency Distributions Ungrouped frequency distributions = a list of observations and their frequency occurrence when observations are sorted by single values. Grouped frequency distributions = a list of observations and their frequency of occurrence when observations are sorted into categories or intervals
Grouped vs. ungrouped frequency distributions: how to decide How many values? If less than 10 different scores “ungrouped” is good If more than 10 scores use grouped Reminder: how to figure how many values hi - lo + 1
Grouped Frequency Distributions Crucial guidelines for constructing frequency distributions: (1) aim for 10-15 groups (2) Mutually exclusive (each observation should be represented only once) (3) All classes should have equal intervals (even if the frequency for a particular class is zero) Wrong Right 0-5 10-15 5-10 5-9 10-15 0-4 (4) Don’t omit any intervals (5) Make widths convenient (e.g. not 2.2) & bottom score should be a multiple of the width Formula for finding the width in a grouped frequency distribution: interval = (hi - lo +1)/# groups
Example *Groups are called class intervals 84 85 87 80 81 88 89 90 92 92 93 95 96 96 96 97 97 97 97 98 98 98 98 99 99 99 99 99 99 100 100 100 100 101 101 101 101 102 102 103 103 100 100 100 101 102 103 102 100 101 102 100 100 100 100 100 100 104 105 104 106 105 104 105 105 110 110 111 111 110 i = (111 - 80 + 1)/ 10 = 3.2 ~ 3 midpoint = (hi real + lo real) /2 START at the Bottom with the low number Interval Real Limits f Midpoint 110-112 109.5 -112.5 5 111 107-109 106.5 - 109.5 0 108 104-106 103.5 - 106.5 8 105 101-103 100.5 - 103.5 14 102 98-100 97.5 - 100.5 23 99 95-97 94.5 - 97.5 8 96 92-94 91.5 - 94.5 3 93 89-91 88.5 - 91.5 2 90 86-88 85.5 - 88.5 2 87 83-85 82.5 - 85.5 2 84 80-82 79.5 - 82.5 2 81 *Groups are called class intervals * Class intervals are apparent limits b/c it appears that they form the upper and lower boundaries for the class interval, but must take into account the real limits NOTE: I violated the rule of making the bottom score a multiple of the width
Example * Include columns for interval, real limits, frequency, 52 52 53 53 53 54 57 57 57 59 59 62 62 61 61 63 63 63 63 64 64 64 65 66 67 67 68 68 69 69 70 71 71 71 72 74 74 74 76 77 75 79 79 79 79 79 79 79 80 85 85 85 85 85 90 90 90 87 91 91 95 95 95 93 * Include columns for interval, real limits, frequency, and midpoint
Interval Real limits f Midpoint 92-95 91.5-95.5 4 93.5 88-91 87.5-91.5 5 89.5 84-87 83.5-87.5 6 85.5 80-83 79.5-83.5 1 81.5 76-79 75.5-79.5 9 77.5 72-75 71.5-75.5 73.5 68-71 67.5-71.5 8 69.5 64-67 63.5-67.5 7 65.5 60-63 59.5-63.5 61.5 56-59 55.5-59.5 57.5 52-55 51.5-55.5 53.5
Percentiles and Percentile Ranks: Get more out of your frequency distribution Scores alone are meaningless. Compare a score to a standard score with percentiles. Percentile: #s that divide the distribution into 100 = parts Percentile rank: # that represents the % of cases in a comparison group that achieved scores the one cited e.g. PR of 95 on the SAT means that 95% of those taking the SAT performed equally or worse than you and only 5% did better
Example Class grades f c f cprop cum % 91-100 6 32 1.0 100 81-90 4 26 0.81 81 71-80 9 22 0.69 69 61-70 11 13 0.41 41 51-60 2 2 0.06 6 32 First step: find the number of people located at or below each point in the distribution - Note that the cumulative percentage is associated with the upper real limit of its interval. What’s the 81st percentile (careful remember the X values are not points on a scale, but rather intervals)? What is the percentile rank for 70.5?
Add following info to your in group table 52 52 53 53 53 54 57 57 57 59 59 62 62 61 61 63 63 63 63 64 64 64 65 66 67 67 68 68 69 69 70 71 71 71 72 74 74 74 76 77 75 79 79 79 79 79 79 79 80 85 85 85 85 85 90 90 90 87 91 91 95 95 95 93 cumf cum%
Interval Real limits f cf c% Midpoint 92-95 91.5-95.5 4 64 100 93.5 88-91 87.5-91.5 5 60 94 89.5 84-87 83.5-87.5 6 55 86 85.5 80-83 79.5-83.5 1 49 77 81.5 76-79 75.5-79.5 9 48 75 77.5 72-75 71.5-75.5 39 61 73.5 68-71 67.5-71.5 8 34 53 69.5 64-67 63.5-67.5 7 26 41 65.5 60-63 59.5-63.5 19 30 61.5 56-59 55.5-59.5 11 17 57.5 52-55 51.5-55.5 53.5 How about the percentile rank for X = 59.5? What’s the percentile rank for X = 91.5? What’s the 86th percentile? How about the 61st percentile?
Obtaining PR or Interpolation When values don’t appear in the table Class grades f cum f cum prop c % 91-100 6 32 1.0 100 81-90 4 26 0.81 81 71-80 9 22 0.69 69 61-70 11 13 0.41 41 51-60 2 2 0.06 6 Some important symbols: *Cumfll = cf at lower real limit of X *c% = cf/ N (100%) *X = score *Xll = score at lower real limit of X *i = interval width *fi = # of cases in X’s group *N = total # scores
Obtaining PR or Interpolation Class grades f cum f cum prop cum % 91-100 6 32 1.0 100 81-90 4 26 0.81 81 71-80 9 22 0.69 69 61-70 11 13 0.41 41 51-60 2 2 0.06 6 What is the PR of 88? Getting PR from score (X). PR = cumfll + (( X - Xll) / i) (fi) N X = 88 i = (60 - 51 + 1) = 10 cumfll = 22 Xll = 80.5 N = 32 fi = 4 x 100 78.13 %
Obtaining PR or Interpolation Class grades f cum f cum prop cum % 91-100 6 32 1.0 100 81-90 4 26 0.81 81 71-80 9 22 0.69 69 61-70 11 13 0.41 41 51-60 2 2 0.06 6 What is the PR of 88? Getting PR from score (X). using interpolation 81-69 = 12 2.5/10 = a/12 a = 3 81-3 = 78% 90.5-80.5 = 10 88 is 2.5 units down the interval 90 81 88 X 81 69
Obtaining the score (X) from PR Class grades f cum f cum prop cum % 91-100 6 32 1.0 100 81-90 4 26 0.81 81 71-80 9 22 0.69 69 61-70 11 13 0.41 41 51-60 2 2 0.06 6 What is the score that corresponds to a PR of 72? cumf = (PR x N) / 100 = (72 x 32)/ 100 = 23.04 X = Xll + [ i (cumf - cumll )/ fi ] = 80.5 + [ 10 (23.04 - 22)/ 4] = 83.1
Obtaining the score (X) from PR Class grades f cum f cum prop cum % 91-100 6 32 1.0 100 81-90 4 26 0.81 81 71-80 9 22 0.69 69 61-70 11 13 0.41 41 51-60 2 2 0.06 6 What is the score that corresponds to a PR of 72? 90.5-80.5 = 10 9/12 = a/10 a = 7.5 90.5-7.5 = 83 81-69 = 12 72 is 9 down 90 81 X 72 81 69
Interval Real limits f cf c% Midpoint 92-95 91.5-95.5 4 64 100 93.5 88-91 87.5-91.5 5 60 94 89.5 84-87 83.5-87.5 6 55 86 85.5 80-83 79.5-83.5 1 49 77 81.5 76-79 75.5-79.5 9 48 75 77.5 72-75 71.5-75.5 39 61 73.5 68-71 67.5-71.5 8 34 53 69.5 64-67 63.5-67.5 7 26 41 65.5 60-63 59.5-63.5 19 30 61.5 56-59 55.5-59.5 11 17 57.5 52-55 51.5-55.5 53.5 How about the percentile rank for X = 73? What score is at the 88th percentile?
Graphs Visual methods to display data: Basics: Figure: pictorial; photo; drawing Table: organized numerical info Graph: pictorial; axes, #s, etc. Basics: X-axis or abscissa: horizontal line in the graph; IV Y-axis or ordinate: vertical line in the graph: DV Always label axes [graph’s height should be roughly 2/3 to 3/4 the length (see Box 2.1 pg. 49)] Y starts at 0; continuous, no breaks X can change start; break; can be discrete
Bar Graphs Bar graph = nominal or ordinal data (usually nonnumerical values) Each bar = category Height = frequency Bars do NOT touch If uses ordinal data order must be preserved Can be vertical or horizontal
Histogram Histogram = interval and ratio data Same rules as bar graph EXCEPT bars touch Usually for discrete data Width of the bar extends to the real limits of the score
Frequency Distribution Polygon or Line Graph Line graph = interval, ratio data Usually used for continuous data A dot is centered above each score Can also use this type of graph for relative frequencies (proportions) when there is a large amount of data. In this case each dot would be placed at the midpoint of the range 56 57 58 59 60
Cumulative Frequency Graph Cumulative Frequency Graph = can be a bar, histogram, or line. Uses the proportion or percentage Line graph version is typically s-shaped or ogive Always increases
Stem and Leaf Displays Data Set 54 81 82 61 97 83 74 67 86 80 68 87 54 81 82 61 97 83 74 67 86 80 68 87 97 76 88 100 77 98 63 79 99 75 81
Advantages to Stem and Leaf Easy to construct Allows you to identify each and every individual score (frequency distribution just tells you the frequency) Both a picture and listing of scores (if you turn the stem and leaf display on its side it looks like a histogram) * Caveat - just seen as a preliminary means for organizing data
Distributions 3 characteristics that describe a distribution Shape Central tendency: center of distribution Variability: spread of scores Shape: technically shape is defined by an equation that prescribes the exact relationship between each X and Y value on the graph
Characteristics of Distributions defined by shape Skewness (Sk): measure of balance in a distribution Evenly balanced distributions have no Sk; they are normal or symmetrical Positive Sk (Sk+): tail trails to the right (positive dir.) Negative Sk (Sk-): tail trails to the left (negative dir.) Kurtosis (Ku): measure of how peaked a distribution is Platykurtic: relatively flat Leptokurtic: relatively peaked Mesokurtic: neither flat nor peaked
Homework - Chapter 2 1, 2, 4-6, 7, 9, 11, 13, 17, 20, 21, 23, 26 For problems 20, 21, and 23 use either the method of finding PR from X and X from PR that I taught today or the method of interpolation in the book. Your choice!