Statistics for Business and Economics Chapter 2 Methods for Describing Sets of Data
Learning Objectives Describe Qualitative Data Graphically Describe Quantitative Data Graphically Explain Numerical Data Properties Describe Summary Measures Analyze Numerical Data Using Summary Measures
Thinking Challenge X Y Us 36% Our market share far exceeds all competitors! - VP 34% Problem - no zero point. Maybe, a pie chart would be better. 32% 30% X Y Us
Frequency Distribution Data Presentation Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pie Chart Pareto Diagram
Presenting Qualitative Data
Frequency Distribution Data Presentation Pie Chart Pareto Diagram Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph
Summary Table Lists categories & number of elements in category Obtained by tallying responses in category May show frequencies (counts), % or both Row Is Category Major Count Accounting 130 Economics 20 Management 50 Total 200 Tally: |||| |||| |||| ||||
Frequency Distribution Data Presentation Pie Chart Summary Table Data Presentation Qualitative Data Quantitative Data Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pareto Diagram
Bar Graph Equal Bar Widths Bar Height Shows Frequency or % Percent Used Also Frequency Horizontal bars are used for categorical variables. Vertical bars are used for numerical variables. Still, some variation exists on this point in the literature. Also, there are many variations on the bar (e.g., stacked bar) Vertical Bars for Qualitative Variables Zero Point
Frequency Distribution Data Presentation Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pie Chart Pareto Diagram
Pie Chart Shows breakdown of total quantity into categories Majors Useful for showing relative differences Angle size (360°)(percent) Majors Mgmt. Econ. 25% 10% 36° Acct. 65% (360°) (10%) = 36°
Frequency Distribution Data Presentation Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pie Chart Pareto Diagram
Pareto Diagram Like a bar graph, but with the categories arranged by height in descending order from left to right. Equal Bar Widths Bar Height Shows Frequency or % Percent Used Also Frequency Vertical Bars for Qualitative Variables Zero Point
Thinking Challenge You’re an analyst for IRI. You want to show the market shares held by Web browsers in 2006. Construct a bar graph, pie chart, & Pareto diagram to describe the data. Allow students 10-15 minutes to complete this before revealing answers. Browser Mkt. Share (%) Firefox 14 Internet Explorer 81 Safari 4 Others 1
Bar Graph Solution* Market Share (%) Browser
Pie Chart Solution* Market Share
Pareto Diagram Solution* Market Share (%) Browser
Presenting Quantitative Data
Frequency Distribution Data Presentation Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pie Chart Pareto Diagram
Stem-and-Leaf Display 1. Divide each observation into stem value and leaf value Stem value defines class Leaf value defines frequency (count) 2 144677 26 3 028 4 1 2. Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Frequency Distribution Data Presentation Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pie Chart Pareto Diagram
Frequency Distribution Table Steps Determine range Select number of classes Usually between 5 & 15 inclusive Compute class intervals (width) Determine class boundaries (limits) Compute class midpoints Count observations & assign to classes
Frequency Distribution Table Example Raw Data: 24, 26, 24, 21, 27 27 30, 41, 32, 38 Class Midpoint Frequency 15.5 – 25.5 20.5 3 Width 25.5 – 35.5 30.5 5 35.5 – 45.5 40.5 2 (Lower + Upper Boundaries) / 2 Boundaries
Relative Frequency & % Distribution Tables Percentage Distribution The number of classes is usually between 5 and 15. Only 3 are used here for illustration purposes. Class Prop. Class % 15.5 – 25.5 .3 15.5 – 25.5 30.0 25.5 – 35.5 .5 25.5 – 35.5 50.0 35.5 – 45.5 .2 35.5 – 45.5 20.0
Frequency Distribution Data Presentation Data Presentation Qualitative Data Quantitative Data Summary Table Stem-&-Leaf Display Frequency Distribution Histogram Bar Graph Pie Chart Pareto Diagram
Histogram Count 5 4 Frequency Relative Frequency 3 Percent Bars Touch Class Freq. Count 15.5 – 25.5 3 5 25.5 – 35.5 5 35.5 – 45.5 2 4 Frequency Relative Frequency Percent 3 Bars Touch 2 1 0 15.5 25.5 35.5 45.5 55.5 Lower Boundary
Numerical Data Properties
Thinking Challenge $400,000 $70,000 $50,000 $30,000 $20,000 11 total employees; total salaries are $770,000. The mode is $20,000 (Union argument). The median is $30,000. The mean is $70,000 (President’s argument). Different measures are used! $50,000 ... employees cite low pay -- most workers earn only $20,000. ... President claims average pay is $70,000! $30,000 $20,000
Standard Notation Measure Sample Population Mean X Throughout this chapter, we will be using the following notation, which I will introduce now. Standard Deviation S S 2 Variance 2 Size n N
Numerical Data Properties Central Tendency (Location) Location (Position) Concerned with where values are concentrated. Variation (Dispersion) Concerned with the extent to which values vary. Shape Concerned with extent to which values are symmetrically distributed. Variation (Dispersion) Shape
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Mean Range Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Central Tendency
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Mean Range Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Mean Measure of central tendency Most common measure Acts as ‘balance point’ Affected by extreme values (‘outliers’) Formula (sample mean) X n i 1 2 …
Mean Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 X X X X X X i X X X X X X 1 2 3 4 5 6 X i 1 n 6 10 . 3 4 . 9 8 . 9 11 . 7 6 . 3 7 . 7 6 8 . 30
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Mean Range Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Median Measure of central tendency Middle value in ordered sequence If n is odd, middle value of sequence If n is even, average of 2 middle values Position of median in sequence Not affected by extreme values Positioning Point n 1 2
Median Example Odd-Sized Sample Raw Data: 24.1 22.6 21.5 23.7 22.6 Ordered: 21.5 22.6 22.6 23.7 24.1 Position: 1 2 3 4 5 n 1 5 1 Positioning Point 3 . 2 2 Median 22 . 6
Median Example Even-Sized Sample Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 n 1 6 1 Positioning Point 3 . 5 2 2 7 . 7 8 . 9 Median 8 . 30 2
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Mean Range Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Mode Measure of central tendency Value that occurs most often Not affected by extreme values May be no mode or several modes May be used for quantitative or qualitative data
Mode Example No Mode Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 One Mode Raw Data: 6.3 4.9 8.9 6.3 4.9 4.9 More Than 1 Mode Raw Data: 21 28 28 41 43 43
Thinking Challenge You’re a financial analyst for Prudential-Bache Securities. You have collected the following closing stock prices of new stock issues: 17, 16, 21, 18, 13, 16, 12, 11. Describe the stock prices in terms of central tendency. This is the data from problem 3.54 in BL5ed. Give the class 10-15 minutes to compute before showing the answer.
Central Tendency Solution* Mean n X i X X … X 1 2 8 X i 1 n 8 17 16 21 18 13 16 12 11 8 15 . 5
Central Tendency Solution* Median Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Median = 6.5 Position = (n+1)/2 = (10+1)/2 = 5.5 1 2 3 5 6 7 8 8 9 11 1 2 3 4 5 6 7 8 9 10 (n = 10) (6+7)/2 = 6.5 n 1 8 1 Positioning Point 4 . 5 2 2 16 16 Median 16 2
Central Tendency Solution* Mode Raw Data: 17 16 21 18 13 16 12 11 Mode = 16 Mode = 8 Midrange = 6 (Xsmallest + Xlargest)/2 = (1+11)/2 = 6
Summary of Central Tendency Measures Formula Description Mean X / n Balance Point i Median ( n +1) Middle Value Position 2 When Ordered Mode none Most Frequent
Shape
Shape Describes how data are distributed Measures of Shape Skew = Symmetry Shape Concerned with extent to which values are symmetrically distributed. Kurtosis The extent to which a distribution is peaked (flatter or taller). For example, a distribution could be more peaked than a normal distribution (still may be ‘bell-shaped). If values are negative, then distribution is less peaked than a normal distribution. Skew The extent to which a distribution is symmetric or has a tail. Values are 0 if normal distribution. If the values are negative, then negative or left-skewed. Left-Skewed Symmetric Right-Skewed Mean Median Mean = Median Median Mean
Variation
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Range Mean Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Range Measure of dispersion Difference between largest & smallest observations Range = Xlargest – Xsmallest Ignores how data are distributed 7 8 9 10 7 8 9 10 Range = 10 – 7 = 3 Range = 10 – 7 = 3
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Mean Range Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Variance & Standard Deviation Measures of dispersion Most common measures Consider how data are distributed 4. Show variation about mean (X or μ) X = 8.3 4 6 8 10 12
Sample Variance Formula X n i 2 1 ( ) X n 1 2 ( ) … = n - 1 in denominator! (Use N if Population Variance)
Sample Standard Deviation Formula 2 S S n ( ) 2 X X i i 1 n 1 ( ) ( ) ( ) 2 2 2 X X X X … X X 1 2 n n 1
( ) ( ) ( ) ( ) Variance Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 n ( ) n 2 X X X i i 2 S i 1 where X i 1 8 . 3 n 1 n ( ) ( ) ( ) 2 2 2 10 . 3 8 . 3 4 . 9 8 . 3 … 7 . 7 8 . 3 2 S 6 1 6 . 368
Thinking Challenge You’re a financial analyst for Prudential-Bache Securities. You have collected the following closing stock prices of new stock issues: 17, 16, 21, 18, 13, 16, 12, 11. What are the variance and standard deviation of the stock prices? This is the data from problem 3.54 in BL5ed. Give the class 10-15 minutes to compute before showing the answer.
( ) ( ) ( ) ( ) Variation Solution* Sample Variance Raw Data: 17 16 21 18 13 16 12 11 n ( ) n 2 Using exact values: Midhinge = (Q1 + Q3)/2 = (2.75 + 8.25)/2 = 11/2 = 5.5 X X X i i 2 S i 1 where X i 1 15 . 5 n 1 n ( ) ( ) ( ) 2 2 2 17 15 . 5 16 15 . 5 … 11 15 . 5 2 S 8 1 11 . 14
( ) Variation Solution* Sample Standard Deviation X X S S 11 2 Using exact values: Midhinge = (Q1 + Q3)/2 = (2.75 + 8.25)/2 = 11/2 = 5.5 X X i 2 S S i 1 11 . 14 3 . 34 n 1
Summary of Variation Measures Formula Description Range X – X Total Spread largest smallest Standard Deviation X n i 2 1 Dispersion about (Sample) Sample Mean Standard Deviation X N i 2 Dispersion about (Population) Population Mean Variance ( X X ) 2 Squared Dispersion i (Sample) n – 1 about Sample Mean
Interpreting Standard Deviation
Interpreting Standard Deviation: Chebyshev’s Theorem Applies to any shape data set No useful information about the fraction of data in the interval x – s to x + s At least 3/4 of the data lies in the interval x – 2s to x + 2s At least 8/9 of the data lies in the interval x – 3s to x + 3s In general, for k > 1, at least 1 – 1/k2 of the data lies in the interval x – ks to x + ks
Interpreting Standard Deviation: Chebyshev’s Theorem No useful information At least 3/4 of the data At least 8/9 of the data
Chebyshev’s Theorem Example Previously we found the mean closing stock price of new stock issues is 15.5 and the standard deviation is 3.34. Use this information to form an interval that will contain at least 75% of the closing stock prices of new stock issues.
Chebyshev’s Theorem Example At least 75% of the closing stock prices of new stock issues will lie within 2 standard deviations of the mean. x = 15.5 s = 3.34 (x – 2s, x + 2s) = (15.5 – 2∙3.34, 15.5 + 2∙3.34) = (8.82, 22.18)
Interpreting Standard Deviation: Empirical Rule Applies to data sets that are mound shaped and symmetric Approximately 68% of the measurements lie in the interval μ – σ to μ + σ Approximately 95% of the measurements lie in the interval μ – 2σ to μ + 2σ Approximately 99.7% of the measurements lie in the interval μ – 3σ to μ + 3σ
Interpreting Standard Deviation: Empirical Rule μ – 3σ μ – 2σ μ – σ μ μ + σ μ +2σ μ + 3σ Approximately 68% of the measurements Approximately 95% of the measurements Approximately 99.7% of the measurements
Empirical Rule Example Previously we found the mean closing stock price of new stock issues is 15.5 and the standard deviation is 3.34. If we can assume the data is symmetric and mound shaped, calculate the percentage of the data that lie within the intervals x + s, x + 2s, x + 3s.
Empirical Rule Example According to the Empirical Rule, approximately 68% of the data will lie in the interval (x – s, x + s), (15.5 – 3.34, 15.5 + 3.34) = (12.16, 18.84) Approximately 95% of the data will lie in the interval (x – 2s, x + 2s), (15.5 – 2∙3.34, 15.5 + 2∙3.34) = (8.82, 22.18) Approximately 99.7% of the data will lie in the interval (x – 3s, x + 3s), (15.5 – 3∙3.34, 15.5 + 3∙3.34) = (5.48, 25.52)
Numerical Measures of Relative Standing
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Range Mean Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Numerical Measures of Relative Standing: Percentiles Describes the relative location of a measurement compared to the rest of the data The pth percentile is a number such that p% of the data falls below it and (100 – p)% falls above it Median = 50th percentile
Percentile Example You scored 560 on the GMAT exam. This score puts you in the 58th percentile. What percentage of test takers scored lower than you did? What percentage of test takers scored higher than you did?
Percentile Example What percentage of test takers scored lower than you did? 58% of test takers scored lower than 560. What percentage of test takers scored higher than you did? (100 – 58)% = 42% of test takers scored higher than 560.
Numerical Data Properties & Measures Central Relative Standing Variation Tendency Range Mean Percentiles Median Interquartile Range Z–scores Mode Variance Standard Deviation
Numerical Measures of Relative Standing: Z–Scores Describes the relative location of a measurement compared to the rest of the data Sample z–score x – x s z = Population z–score x – μ σ Measures the number of standard deviations away from the mean a data value is located
Z–Score Example The mean time to assemble a product is 22.5 minutes with a standard deviation of 2.5 minutes. Find the z–score for an item that took 20 minutes to assemble. Find the z–score for an item that took 27.5 minutes to assemble.
Z–Score Example x = 20, μ = 22.5 σ = 2.5 x – μ 20 – 22.5 z = = = –1.0 = 2.0
Quartiles & Box Plots
( ) Quartiles Measure of noncentral tendency 2. Split ordered data into 4 quarters 25% Q1 Q2 Q3 Positioning Point of Q i n 1 4 ( ) 3. Position of i-th quartile
( ) ( ) Quartile (Q1) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 ( ) ( ) 1 n 1 1 6 1 Q Position 1 . 75 2 1 4 4 Q 6 . 3 1
( ) ( ) Quartile (Q2) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 ( ) ( ) 2 n 1 2 6 1 Q Position 3 . 5 2 4 4 7 . 7 8 . 9 Q 8 . 3 2 2
( ) ( ) Quartile (Q3) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 ( ) ( ) 3 n 1 3 6 1 Q Position 5 . 25 5 3 4 4 Q 10 . 3 3
Numerical Data Properties & Measures Central Variation Shape Tendency Mean Range Skew Median Interquartile Range Mode Variance Standard Deviation
Interquartile Range Measure of dispersion Also called midspread Difference between third & first quartiles Interquartile Range = Q3 – Q1 4. Spread in middle 50% 5. Not affected by extreme values
Thinking Challenge You’re a financial analyst for Prudential-Bache Securities. You have collected the following closing stock prices of new stock issues: 17, 16, 21, 18, 13, 16, 12, 11. What are the quartiles, Q1 and Q3, and the interquartile range? This is the data from problem 3.54 in BL5ed. Give the class 10-15 minutes to compute before showing the answer.
( ) ( ) Quartile Solution* Q1 Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Q1 = 1(n+1)/4 = 1(10+1)/4 = 11/4 = 2.75 Position If exact values: 75% of way Between 2 & 3; Value is 2.75 ( ) ( ) 1 n 1 1 8 1 Q Position 2 . 5 1 4 4 Q 12 . 5 1
( ) ( ) Quartile Solution* Q3 Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Q3 = 3(n+1)/4 = 3(10+1)/4 = 33/4 = 8.25 Position If exact values: 25% of way Between 8 & 9; Value is 8.25 ( ) ( ) 3 n 1 3 8 1 Q Position 6 . 75 7 3 4 4 Q 18 3
Interquartile Range Solution* Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Using exact values: Midhinge = (Q1 + Q3)/2 = (2.75 + 8.25)/2 = 11/2 = 5.5 Interquartile Range Q Q 18 . 12 . 5 5 . 5 3 1
Box Plot 1. Graphical display of data using 5-number summary X Q Median Q X smallest 1 3 largest 4 6 8 10 12
Shape & Box Plot Left-Skewed Symmetric Right-Skewed Q Median Q Q 1 3 1 3 1 3
Graphing Bivariate Relationships
Graphing Bivariate Relationships Describes a relationship between two quantitative variables Plot the data in a Scattergram Positive relationship Negative relationship No relationship x y
Scattergram Example You’re a marketing analyst for Hasbro Toys. You gather the following data: Ad $ (x) Sales (Units) (y) 1 1 2 1 3 2 4 2 5 4 Draw a scattergram of the data
Scattergram Example Sales 4 3 2 1 1 2 3 4 5 Advertising
Time Series Plot
Time Series Plot Used to graphically display data produced over time Shows trends and changes in the data over time Time recorded on the horizontal axis Measurements recorded on the vertical axis Points connected by straight lines
Time Series Plot Example The following data shows the average retail price of regular gasoline in New York City for 8 weeks in 2006. Draw a time series plot for this data. Date Average Price Oct 16, 2006 $2.219 Oct 23, 2006 $2.173 Oct 30, 2006 $2.177 Nov 6, 2006 $2.158 Nov 13, 2006 $2.185 Nov 20, 2006 $2.208 Nov 27, 2006 $2.236 Dec 4, 2006 $2.298
Time Series Plot Example Price Date
Distorting the Truth with Descriptive Techniques
Errors in Presenting Data Using ‘chart junk’ No relative basis in comparing data batches Compressing the vertical axis No zero point on the vertical axis
‘Chart Junk’ Bad Presentation Good Presentation $ Minimum Wage 1960: $1.00 4 1970: $1.60 2 1980: $3.10 1990: $3.80 1960 1970 1980 1990
No Relative Basis Bad Presentation Good Presentation Freq. % A’s by Class A’s by Class Freq. % 300 30% 200 20% 100 10% 0% FR SO JR SR FR SO JR SR
Compressing Vertical Axis Bad Presentation Good Presentation Quarterly Sales Quarterly Sales $ $ 200 50 100 25 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
No Zero Point on Vertical Axis Bad Presentation Good Presentation Monthly Sales Monthly Sales $ $ 45 60 42 40 39 20 36 J M M J S N J M M J S N
Conclusion Described Qualitative Data Graphically Described Numerical Data Graphically Explained Numerical Data Properties Described Summary Measures Analyzed Numerical Data Using Summary Measures