Chapter 3 EXPLORATION DATA ANALYSIS 3.1 GRAPHICAL DISPLAY OF DATA 3.2 MEASURES OF CENTRAL TENDENCY 3.3 MEASURES OF DISPERSION
3.1 Graphical Display of Data Most of the statistical information in newspapers, magazines, company reports and other publications consists of data that are summarized and presented in a form that is easy for the reader to understand In this chapter we will discusses and displays several graphical tools for summarizing and presenting data, including histogram, frequency polygon, ogive, dot plot, bar chart, pie chart and the scatter plot for two- variable numerical data.
3.1 Graphical Display of Data: Ungroup Versus Group of Data Ungrouped data have not been summarized in any way are also called raw data Grouped data logical groupings of data exists i.e. age ranges (20-29, 30-39, etc.) have been organized into a frequency distribution
Ages of a Sample of Managers from Urban Child Care Centers in the United States 3.1 Graphical Display of Data Example of Ungrouped Data
3.1 Graphical Display of Data Frequency Distribution Frequency Distribution – summary of data presented in the form of class intervals and frequencies Vary in shape and design Constructed according to the individual researcher's preferences
Steps in Frequency Distribution Step 1 - Determine range of frequency distribution Range is the difference between the high and the lowest numbers Step 2 – determine the number of classes Don’t use too many, or two few classes Step 3 – Determine the width of the class interval Approx class width can be calculated by dividing the range by the number of classes Values fit into only one class Frequency Distribution
Class Interval Frequency 20-under under under under under under 801 Frequency Distribution of Child Care Manager’s Ages
Relative Class IntervalFrequencyFrequency 20-under under under under under under Total The relative frequency is the proportion of the total frequency that is any given class interval in a frequency distribution. 3.1 Graphical Display of Data Relative Frequency
The cumulative frequency is a running total of frequencies through the classes of a frequency distribution. 3.1 Graphical Display of Data Cumulative Frequency Cumulative Class IntervalFrequencyFrequency 20-under under under under under under Total50
Histogram -- vertical bar chart of frequencies Frequency Polygon -- line graph of frequencies Ogive -- line graph of cumulative frequencies Stem and Leaf Plot – Like a histogram, but shows individual data values. Useful for small data sets. Pareto Chart -- type of chart which contains both bars and a line graph. The bars display the values in descending order, and the line graph shows the cumulative totals of each category, left to right. The purpose is to highlight the most important among a (typically large) set of factors. Common Statistical Graphs – Quantitative Data
3.1 Graphical Display of Data Histogram A histogram is a graphical summary of a frequency distribution The number and location of bins (bars) should be determined based on the sample size and the range of the data
Smallest Largest Data Range
Number of Classes and Class Width The number of classes should be between 5 and 15. Fewer than 5 classes cause excessive summarization. More than 15 classes leave too much detail. Or use the formula no. of class = log n (n = numbers set of data) Class Width Divide the range by the number of classes for an approximate class width Round up to a convenient number
The midpoint of each class interval is called the class midpoint or the class mark. Class Midpoint
Relative Cumulative Class IntervalFrequencyMidpointFrequencyFrequency 20-under under under under under under Total Midpoints for Age Classes
Class IntervalFrequency 20-under under under under under under 801 Histogram
Class IntervalFrequency 20-under under under under under under 801 Frequency Polygon
Cumulative Class IntervalFrequency 20-under under under under under under 8050 Ogive
Stem and Leaf plot: Safety Examination Scores for Plant Trainees Raw Data Stem Leaf
Construction of Stem and Leaf Plot Raw Data Stem Leaf Stem Leaf Stem Leaf
Common Statistical Graphs – Qualitative Data Pie Chart -- proportional representation for categories of a whole Bar Chart – frequency or relative frequency of one more categorical variables
COMPLAINTNUMBERPROPORTION DEGREES Stations, etc.28, Train Performance 14, Equipment10, Personnel9, Schedules, etc. 7, Total70, Complaints by Amtrak Passengers
Second Quarter U.S. Truck Production Second Quarter Truck Production in the U.S. (Hypothetical values) 2d Quarter Truck Production Company A B C D E Totals 357, , ,997 34,099 12, ,190
Second Quarter U.S. Truck Production
2d Quarter Truck Production ProportionDegreesCompany A B C D E Totals 357, , ,997 34,099 12, , Pie Chart Calculations for Company A
3.2 Measures of Central Tendency: Ungrouped Data Measures of central tendency yield information about “particular places or locations in a group of numbers.” Common Measures of Location Mode Median Mean Percentiles Quartiles
Mode - the most frequently occurring value in a data set Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio) Can be used to determine what categories occur most frequently Sometimes, no mode exists (no duplicates) Bimodal – In a tie for the most frequently occurring value, two modes are listed Multimodal -- Data sets that contain more than two modes Mode
Median Median - middle value in an ordered array of numbers. Half the data are above it, half the data are below it Mathematically, it’s the (n+1)/2 th ordered observation For an array with an odd number of terms, the median is the middle number n=11 => (n+1)/2 th = 12/2 th = 6 th ordered observation For an array with an even number of terms the median is the average of the middle two numbers n=10 => (n+1)/2 th = 11/2 th = 5.5 th = average of 5 th and 6 th ordered observation
Arithmetic Mean Mean is the average of a group of numbers Applicable for interval and ratio data Not applicable for nominal or ordinal data Affected by each value in the data set, including extreme values Computed by summing all values in the data set and dividing the sum by the number of values in the data set
The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows. Company Number of Cars in Service Enterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000; Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000 Compute the mode, the median, and the mean. Demonstration Problem 3.1
Solutions Mode: 9,000 (two companies with 9,000 cars in service) Median: With 13 different companies in this group, N = 13. The median is located at the (13 +1)/2 = 7th position. Because the data are already ordered, median is the 7th term, which is 20,000. Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23
Which Measure Do I Use? Which measure of central tendency is most appropriate? In general, the mean is preferred, since it has nice mathematical properties (in particular, see chapter 7) The median and quartiles, are resistant to outliers Consider the following three datasets 1, 2, 3 (median=2, mean=2) 1, 2, 6 (median=2, mean=3) 1, 2, 30 (median=2, mean=11) All have median=2, but the mean is sensitive to the outliers In general, if there are outliers, the median is preferred to the mean
IntervalFrequency (f)Midpoint (M) f*M 20-under under under under under under Calculation of Grouped Mean Sometimes data are already grouped, and you are interested in calculating summary statistics
Cumulative Class IntervalFrequency Frequency 20-under under under under under under N = 50 Median of Grouped Data - Example
Mode of Grouped Data Class IntervalFrequency 20-under under under under under under 80 1 Midpoint of the modal class Modal class has the greatest frequency
3.3 Measures of Dispersion : Range The difference between the largest and the smallest values in a set of data Advantage – easy to compute Disadvantage – is affected by extreme values
3.3 Measures of Dispersion : Sample Variance Sample Variance - average of the squared deviations from the arithmetic mean Sample Variance – denoted by s2 X 2, ,625 1,844715,041 1, ,756 1, ,444
3.3 Measures of Dispersion : Sample Standard Deviation Sample standard deviation is the square root of the sample variance Same units as original data