Initial Data Analysis Frequency
IDA Often overlooked or sloughed off as being not all that important but… It is at the beginning stages where much trouble can be avoided and if the data is glossed over this can lead to missed findings or results that will not be able to be replicated because they represent bad data. Bad data?
IDA includes: A healthy inspection of the individual variables’ behaviors Outlier analysis Descriptive and graphical output
Describing and Exploring Data Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative. Several options are available including plotting the data or calculating descriptive statistics.
Plotting Data Often, the first thing one does with a set of raw data is to plot frequency distributions. Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram.
Frequency Data Example: Age as estimated by a questionnaire in a statistics class. Note: The frequencies in the adjacent table were calculated by simply counting the number of subjects having the specified value for the age variable.
Grouping data Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did). However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner.
Grouped Frequency Distribution Example: Binning our weight variable. For example, with a variable like weight we might obtain a range from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero). We can get around this problem by grouping our values into bins. Try for around 10 classes (or bins) with natural splits.
Graphic Depiction of Frequency Histogram Similar to a bar chart with the only difference being that histograms are representative of non-nominal data. Age example
Weight example Check out this demo which clearly shows how the width of the bin that you select can clearly affect the “look” of the datathis demo Here is another similar demonstration of the effects of bin width demonstration
Number of Classes and Class Width The number of classes should be between 5 and 15. Fewer than 5 classes cause excessive summarization. More than 15 classes tends not to add much. Class Width Divide the range by the number of classes for an approximate class width Round up to a convenient number
Example of Ungrouped Data Scores on a social introversion inventory
Relative Frequency Relative Relative Class IntervalFrequencyFrequency 20-under under under under under under Total Total501.00
Cumulative Frequency Cumulative Cumulative Class IntervalFrequencyFrequency 20-under under under under under under Total50 Total50
Class Midpoints, Relative Frequencies, and Cumulative Frequencies RelativeCumulative RelativeCumulative Class IntervalFrequencyMidpointFrequencyFrequency 20-under under under under under under Total Total501.00
Cumulative Relative Frequencies Cumulative RelativeCumulativeRelative RelativeCumulativeRelative Class IntervalFrequencyFrequencyFrequencyFrequency 20-under under under under under under Total Total501.00
Histogram Construction Class IntervalFrequency 20-under under under under under under 801
Frequency Polygon Class IntervalFrequency 20-under under under under under under 801
Advantages/Disadvantages With the grouped frequency distribution we can take large data sets and make them much more manageable and easier to understand. However, we also lose information about individual data points.
Stem and Leaf Plots If values of a variable must be grouped prior to creating a frequency plot, then the information related to the specific values becomes lost in the process (i.e., the resulting graph depicts only the frequency values associated with the grouped values). However, it is possible to obtain the graphical advantage of grouping and still keep all of the information if stem & leaf plots are used.
Stem and Leaf Plots These plots are created by splitting a data point into that part associated with the ‘group’ and that associated with the individual point. For example, the numbers 180, 180, 181, 182, 185, 186, 187, 187, 189 could be represented as:
Raw Data Stem Leaf
Construction of Stem and Leaf Plot Raw Data Stem Leaf Stem Leaf Stem Leaf
Thus, we could represent our weight data in the following stem & leaf plot:
Stem & leaf plots are especially nice for comparing distributions.
Advantages Using a stem and leaf offers several advantages It retains individual data points Displays large amounts of data well (compared to a normal frequency distribution) Provides a ‘graphical’ display of the data Disadvantage Kind of ugly
Terminology Related to Distributions Often, frequency histograms tend to have a roughly symmetrical bell-shape and such distributions are called normal or gaussian.
Sometimes, the bell shape is not symmetrical. The term positive skew refers to the situation where the “tail” of the distribution is to the right, negative skew is when the “tail” is to the left.
Example: Pizza Data
Distribution Shapes Normal Positively Skewed Negatively Skewed Bimodal