Initial Data Analysis Frequency. IDA  Often overlooked or sloughed off as being not all that important but…  It is at the beginning stages where much.

Initial Data Analysis Frequency

IDA  Often overlooked or sloughed off as being not all that important but…  It is at the beginning stages where much trouble can be avoided and if the data is glossed over this can lead to missed findings or results that will not be able to be replicated because they represent bad data.  Bad data?

 IDA includes: A healthy inspection of the individual variables’ behaviors Outlier analysis Descriptive and graphical output

Describing and Exploring Data  Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative.  Several options are available including plotting the data or calculating descriptive statistics.

Plotting Data  Often, the first thing one does with a set of raw data is to plot frequency distributions.  Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram.

Frequency Data  Example: Age as estimated by a questionnaire in a statistics class.  Note: The frequencies in the adjacent table were calculated by simply counting the number of subjects having the specified value for the age variable.

Grouping data  Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did).  However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner.

Grouped Frequency Distribution Example: Binning our weight variable.  For example, with a variable like weight we might obtain a range from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero).  We can get around this problem by grouping our values into bins. Try for around 10 classes (or bins) with natural splits.

Graphic Depiction of Frequency  Histogram Similar to a bar chart with the only difference being that histograms are representative of non-nominal data. Age example 

Weight example  Check out this demo which clearly shows how the width of the bin that you select can clearly affect the “look” of the datathis demo  Here is another similar demonstration of the effects of bin width demonstration

Number of Classes and Class Width  The number of classes should be between 5 and 15. Fewer than 5 classes cause excessive summarization. More than 15 classes tends not to add much.  Class Width Divide the range by the number of classes for an approximate class width Round up to a convenient number

42 30 53 50 52 30 55 49 61 74 26 58 40 28 36 30 33 31 37 32 37 30 32 23 32 58 43 30 29 34 50 47 31 35 26 64 46 40 43 57 30 49 40 25 50 52 32 60 54 Example of Ungrouped Data Scores on a social introversion inventory

Relative Frequency Relative Relative Class IntervalFrequencyFrequency 20-under 306.12 30-under 4018.36 40-under 5011.22 50-under 6011.22 60-under 703.06 70-under 80 1.02 Total501.00 Total501.00

Cumulative Frequency Cumulative Cumulative Class IntervalFrequencyFrequency 20-under 3066 30-under 401824 40-under 501135 50-under 601146 60-under 70349 70-under 80 150 Total50 Total50

Class Midpoints, Relative Frequencies, and Cumulative Frequencies RelativeCumulative RelativeCumulative Class IntervalFrequencyMidpointFrequencyFrequency 20-under 30625.126 30-under 401835.3624 40-under 501145.2235 50-under 601155.2246 60-under 70365.0649 70-under 80 175.0250 Total501.00 Total501.00

Cumulative Relative Frequencies Cumulative RelativeCumulativeRelative RelativeCumulativeRelative Class IntervalFrequencyFrequencyFrequencyFrequency 20-under 306.126.12 30-under 4018.3624.48 40-under 5011.2235.70 50-under 6011.2246.92 60-under 703.0649.98 70-under 80 1.02501.00 Total501.00 Total501.00

Histogram Construction Class IntervalFrequency 20-under 306 30-under 4018 40-under 5011 50-under 6011 60-under 703 70-under 801

Frequency Polygon Class IntervalFrequency 20-under 306 30-under 4018 40-under 5011 50-under 6011 60-under 703 70-under 801

Advantages/Disadvantages  With the grouped frequency distribution we can take large data sets and make them much more manageable and easier to understand.  However, we also lose information about individual data points.

Stem and Leaf Plots  If values of a variable must be grouped prior to creating a frequency plot, then the information related to the specific values becomes lost in the process (i.e., the resulting graph depicts only the frequency values associated with the grouped values).  However, it is possible to obtain the graphical advantage of grouping and still keep all of the information if stem & leaf plots are used.

Stem and Leaf Plots  These plots are created by splitting a data point into that part associated with the ‘group’ and that associated with the individual point.  For example, the numbers 180, 180, 181, 182, 185, 186, 187, 187, 189 could be represented as: 18 001256779

86 76 23 77 81 79 68 77 92 59 68 75 83 49 91 47 72 82 74 70 56 60 88 75 97 39 78 94 55 67 83 89 67 91 81 Raw Data Stem 2345678923456789 Leaf 3 9 7 9 5 6 9 0 7 7 8 8 0 2 4 5 5 6 7 7 8 9 1 1 2 3 3 6 8 9 1 1 2 4 7

Construction of Stem and Leaf Plot 86 76 23 77 81 79 68 77 92 59 68 75 83 49 91 47 72 82 74 70 56 60 88 75 97 39 78 94 55 67 83 89 67 91 81 Raw Data Stem 2345678923456789 Leaf 3 9 7 9 5 6 9 0 7 7 8 8 0 2 4 5 5 6 7 7 8 9 1 1 2 3 3 6 8 9 1 1 2 4 7 Stem Leaf Stem Leaf

 Thus, we could represent our weight data in the following stem & leaf plot:

 Stem & leaf plots are especially nice for comparing distributions.

Advantages  Using a stem and leaf offers several advantages It retains individual data points Displays large amounts of data well (compared to a normal frequency distribution) Provides a ‘graphical’ display of the data  Disadvantage Kind of ugly

Terminology Related to Distributions  Often, frequency histograms tend to have a roughly symmetrical bell-shape and such distributions are called normal or gaussian.

 Sometimes, the bell shape is not symmetrical.  The term positive skew refers to the situation where the “tail” of the distribution is to the right, negative skew is when the “tail” is to the left.

Example: Pizza Data

Distribution Shapes Normal Positively Skewed Negatively Skewed Bimodal

Initial Data Analysis Frequency. IDA  Often overlooked or sloughed off as being not all that important but…  It is at the beginning stages where much.

Similar presentations

Presentation on theme: "Initial Data Analysis Frequency. IDA  Often overlooked or sloughed off as being not all that important but…  It is at the beginning stages where much."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Initial Data Analysis Frequency. IDA  Often overlooked or sloughed off as being not all that important but…  It is at the beginning stages where much.

Similar presentations

Presentation on theme: "Initial Data Analysis Frequency. IDA  Often overlooked or sloughed off as being not all that important but…  It is at the beginning stages where much."— Presentation transcript:

Similar presentations

About project

Feedback