Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive Statistics

Similar presentations


Presentation on theme: "Descriptive Statistics"— Presentation transcript:

1 Descriptive Statistics
Frequency Tables Visual Displays Measures of Center

2 Overview Descriptive Statistics Inferential Statistics
summarize or describe the important characteristics of a known set of population data Inferential Statistics use sample data to make inferences (or generalizations) about a population This chapter will focus on descriptive statistics. Using descriptive statistics and the information from Chapter 3, 4, and 5, we will begin the study of inferential statistics.

3 Important Characteristics of Data
1. Center: A representative or average value that indicates where the middle of the data set is located 2. Variation: A measure of the amount that the values vary among themselves 3. Distribution: The nature or shape of the distribution of data (such as bell-shaped, uniform, or skewed) 4. Outliers: Sample values that lie very far away from the vast majority of other sample values 5. Time: Changing characteristics of the data over time Most important characteristics necessary to describe, explore, and compare data sets. page 34 of text

4 Summarizing Data With Frequency Tables
lists classes (or categories) of values, along with frequencies (or counts) of the number of values that fall into each class page 35 of text

5 Qwerty Keyboard Word Ratings
Table 2-1 Qwerty Keyboard Word Ratings 1 7 Discussion of what these rating values represent is found on page 33 (the Chapter Problem for Chapter 2).

6 Frequency Table of Qwerty Word Ratings
Rating Frequency Final result of a frequency table on the Qwerty data given in previous slide. Next few slides will develop definitions and steps for developing the frequency table.

7 Frequency Table Definitions

8 Lower Class Limits Lower Class Limits
are the smallest numbers that can actually belong to different classes Rating Frequency Lower Class Limits

9 Upper Class Limits Upper Class Limits
are the largest numbers that can actually belong to different classes Rating Frequency Upper Class Limits

10 number separating classes
Class Boundaries number separating classes Rating Frequency - 0.5 2.5 5.5 8.5 11.5 14.5 Class Boundaries

11 midpoints of the classes
Class Midpoints midpoints of the classes Rating Frequency Class Midpoints Being able to identify the midpoints of each class will be important when determining the mean and standard deviation of a frequency table.

12 Class Width Class Width
is the difference between two consecutive lower class limits or two consecutive class boundaries Rating Frequency Class Width Class widths ideally should be the same between each class. However, open-ended first or last classes (e.g., 65 years or older) are sometimes necessary to keep from have a large number of classes with a frequency of 0.

13 Guidelines For Frequency Tables
1. Be sure that the classes are mutually exclusive. 2. Include all classes, even if the frequency is zero. 3. Try to use the same width for all classes. 4. Select convenient numbers for class limits. 5. Use between 5 and 20 classes. 6. The sum of the class frequencies must equal the number of original data values. page 36 of text #1: Classes should not overlap in order to determine which class a value belongs to. #2: A common mistake of students developing frequency tables. #3: Open-ended classes are sometimes necessary #4: Round up to use fewer decimal places or use relevant numbers #5: Less than 5, the table will be too concise; more than 20, the table will be too large or unwieldy. #6: One test to make sure all data has been included.

14 Constructing A Frequency Table class width  round up of
1. Decide on the number of classes . 2. Determine the class width by dividing the range by the number of classes (range = highest score - lowest score) and round up. range class width  round up of number of classes 3. Select for the first lower limit either the lowest score or a convenient value slightly less than the lowest score. 4. Add the class width to the starting point to get the second lower class limit, add the width to the second lower limit to get the third, and so on. 5. List the lower class limits in a vertical column and enter the upper class limits. 6. Represent each score by a tally mark in the appropriate class. Total tally marks to find the total frequency for each class. This is the same as the flow chart on the next slide.

15 Inputting a Frequency Table into the TI83 Calculator
Push Stat Select Edit Input Class Marks in List 1 Input Frequencies in List 2

16 Relative Frequency Table
class frequency sum of all frequencies

17 Relative Frequency Table
% % % % % Rating Relative Frequency Rating Frequency 20/52 = 38.5% 14/52 = 26.9% etc. page 38 of text Emphasis on the relative frequency table will assist in development the concept of probability distributions - the use of percentages with classes will relate to the use of probabilities with random variables. Total frequency = 52

18 Cumulative Frequency Table
Less than 3 20 Less than 6 34 Less than 9 49 Less than Less than Rating Cumulative Frequency Rating Frequency Cumulative Frequencies Some students will need a detailed explanation of how the classes “Less than 3”, “Less than 6”, etc. are identified and the frequency for each of the classes are determined.

19 Frequency Tables Relative Cumulative Frequency Frequency Rating Rating
Less than Less than Less than Less than Less than % % % % % A comparison of all three types of frequency tables developed from the same QWERTY data.

20 depict the nature or shape of the data distribution
Pictures of Data depict the nature or shape of the data distribution Histogram a bar graph in which the horizontal scale represents classes and the vertical scale represents frequencies The heights of the bars correspond to the frequency values, and the bars are drawn adjacent to each other (without gaps). page 42 of text

21 Histogram of Qwerty Word Ratings
Rating Frequency Notice the use of class boundaries to label the left and right sides of each rectangle, and the need for the -0.5 and 14.5 boundaries. It is often more practical to use class midpoint values instead of the class boundaries. A common rule of thumb is that the vertical height of the histogram should be about three-fourths of the total width. Both axes should be clearly labeled.

22 Relative Frequency Histogram
of Qwerty Word Ratings % % % % % Rating Relative Frequency Instead of frequency for the vertical axis, percentages are used.

23 Histogram and Relative Frequency Histogram
A histogram and relative frequency histogram will have the same shape.

24 Viewing a Histogram Using the TI83 Calculator
First input class marks and frequencies in L1 & L2 2nd y= (Statplot) Select 1 Select “on” Select histogram type (top row, 3rd graph) X list = 2nd 1 (L1) Freq = 2nd 2 (L2) Zoom 9 (ZoomStat)

25 Frequency Polygon page 44 of text
The points are at the same heights as the bars of the histogram and are connected. The line segments are extended to the right and left so that the graph begins and ends on the x-axis. Notice the use of midpoints on the horizontal axis.

26 Viewing a Frequency Polygon Using the TI83 Calculator
First input class marks and frequencies in L1 & L2 2nd y= (Statplot) Select 1 Select “on” Select polygon type (top row, 2nd graph) X list = 2nd 1 (L1) Freq = 2nd 2 (L2) Zoom 9 (ZoomStat)

27 Ogive The points are the frequency values of each class in a cumulative frequency table. Notice the use of boundaries on the horizontal scale and that the graph begins with the lower boundary of the first class and ends with the upper boundary of the last class. Ogives are useful for determining the number of values less than some particular class boundary.

28 Dot Plot Dots represent an actual data value.
Dots representing the same value are stacked. Similar to a histogram in that the distribution of the data is shown.

29 Stem-and Leaf Plot Raw Data (Test Grades) 67 72 85 75 89
Leaves Raw Data (Test Grades) 6 7 8 9 10 7 2 5 0 9 page 45 of text Data that has at least two digits better exemplifies the value of a stem-leaf plot. Cover up the actual data and ask students to read the data from the stem-leaf plot. Data should be put in increasing order. A stem-leaf plot builds ‘side-ways’ histogram. The most common mistake students make when developing stem-leaf plots is leaving out a ‘stem’ if there is no data in that class. The other mistake is, if there is no data in a class, putting a ‘0’ next to the stem - which, of course, would indicate a data value with a right digit of ‘0’. Students confuse the use of ‘0’ indicating a frequency in a frequency table with a ‘0’ in a stem-leaf plot indicating an actual data value. (See example on page 47 middle of page.)

30 Pareto Chart Accidental Deaths by Type Frequency page 46 of text
5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 Accidental Deaths by Type Frequency page 46 of text A Pareto chart is used for qualitative data with the bars arranged in order according to decreasing frequencies. Falls Fire Poison Drowning Firearms Motor Vehicle Ingestion of food or object

31 Pie Chart Accidental Deaths by Type Firearms (1400. 1.9%)
( %) Ingestion of food or object ( % Fire ( %) Motor vehicle (43, %) Drowning ( %) Poison ( %) Same information as the previous Pareto chart. A piece of the pie chart is a percentage of the 360 degrees in the circle. Falls (12, %) Accidental Deaths by Type

32 Scatter Diagram TAR NICOTINE
20 TAR 10 A plot of paired (x,y) data with the horizontal x-axis and the vertical y-axis. Chapter 9 will discuss scatter plots again with the topic of correlation. Point out the relationship that exists between the nicotine and tar - as the nicotine value increases, so does the value of tar. page 47 of text 0.0 0.5 1.0 1.5 NICOTINE

33 Other Graphs Boxplots Pictographs Pattern of data over time

34 Measures of Center center or middle a value at the of a data set
page 55 of text

35 Definitions Mean (Arithmetic Mean) AVERAGE
the number obtained by adding the values and dividing the total by the number of values In this text, the arithmetic mean will be referred to as the MEAN.

36 Notation  denotes the addition of a set of values
x is the variable usually used to represent the individual data values n represents the number of data values in a sample N represents the number of data values in a population is pronounced ‘sigma’ (Greek upper case ‘S’) indicates ‘summation’ of values

37 Calculators can calculate the mean of data
Notation x is pronounced ‘x-bar’ and denotes the mean of a set of sample values x = n  x µ is pronounced ‘mu’ and denotes the mean of all values in a population N µ =  x The mean is sensitive to every value, so one exceptional value can affect the mean dramatically. The median (next slide) overcomes that disadvantage. Calculators can calculate the mean of data

38 Finding a Mean Using the TI83 Calculator
Push Stat Select Edit Enter data into L1 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1 (L1) The first output item is the mean

39 Definitions Median often denoted by x (pronounced ‘x-tilde’)
the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude often denoted by x (pronounced ‘x-tilde’) is not affected by an extreme value ~

40 (in order - odd number of values) exact middle MEDIAN is 6.44
(in order odd number of values) exact middle MEDIAN is 6.44 Examples on page 57 of text

41 (even number of values)
no exact middle -- shared by two numbers (even number of values) MEDIAN is 5.02 2 Examples on page 57 of text

42 Finding a Median Using the TI83 Calculator
Push Stat Select Edit Enter data into L1 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1 (L1) Arrow down to “Med” to see the value for median

43 Definitions Mode denoted by M Bimodal Multimodal No Mode
the score that occurs most frequently Bimodal Multimodal No Mode denoted by M the only measure of central tendency that can be used with nominal data page 58 of text

44 Examples a b c Mode is 5 Bimodal 2 and 6 No Mode page 58 of text Name the two values if the set is bimodal

45 highest score + lowest score
Definitions Midrange the value midway between the highest and lowest values in the original data set Midrange is seldom used because it is too sensitive to extremes. (See Figure 2-12, page 59) highest score + lowest score Midrange = 2

46 Finding a Midrange Using the TI83 Calculator
Push Stat Select Edit Enter data into L1 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1 (L1) Arrow down to see the “min” and “max” values for the midrange calculation

47 Round-off Rule for Measures of Center
Carry one more decimal place than is present in the original set of values Round only the final answer, not the intermediate values

48  (f • x) x =  f x = class midpoint  f = n
Mean from a Frequency Table use class midpoint of classes for variable x x = f  (f • x) x = class midpoint f = frequency Calculators can easily find the mean of frequency tables, using the class midpoints and the frequencies. Mean values found from frequency tables will be an approximation of the mean value found using the actual data.  f = n

49 Finding a Grouped Mean Using the TI83 Calculator
Push Stat Select Edit Enter class marks into L1 and frequencies into L2 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1, 2nd 2 (L1,L2) The first output item is the mean

50 Descriptive Statistics
Measures of Variability Measures of Position

51 Definitions Symmetric Skewed
Data is symmetric if the left half of its histogram is roughly a mirror of its right half. Skewed Data is skewed if it is not symmetric and if it extends more to one side than the other. Data skewed to the left is said to be ‘negatively skewed’ with the mean and median to the left of the mode. Data skewed to the right is said to be ‘positively skewed’ with the mean and media to the right of the mode.

52 Skewness SYMMETRIC SKEWED LEFT SKEWED RIGHT (negatively) (positively)
Mode = Mean = Median SYMMETRIC Data lopsided to the right (or slants down to the right) Mean Mode Mode Mean Median Median SKEWED LEFT (negatively) SKEWED RIGHT (positively)

53 Waiting Times of Bank Customers
at Different Banks in minutes Jefferson Valley Bank Bank of Providence 6.5 4.2 6.6 5.4 6.7 5.8 6.8 6.2 7.1 6.7 7.3 7.7 7.4 7.7 7.7 8.5 7.7 9.3 7.7 10.0 Jefferson Valley Bank Bank of Providence Mean Median Mode Midrange 7.15 7.20 7.7 7.10 7.15 7.20 7.7 7.10

54 Dotplots of Waiting Times
Even though the measures of center are all the same, it is obvious from the dotplots of each group of data that there are some differences in the ‘spread’ (or variation) of the data. page 69 of text

55 Range Measures of Variation value highest lowest value
The range is not a very useful measure of variation as it only uses two values of the data. Other measures of variation (to follow) will more more useful as they are computed by using every data value.

56 Measures of Variation Standard Deviation
a measure of variation of the scores about the mean (average deviation from the mean) Ask students to explain what this definition indicates.

57 Sample Standard Deviation Formula
 (x - x)2 S = n - 1 The definition indicates that one should find the average distance each score is from the mean. The use of n-1 in the denominator is necessary because there are only n-1 independent values - that is, only n-1 values can be assigned any number before the nth value is determined. page 70 of text calculators can compute the sample standard deviation of data

58 Sample Standard Deviation Shortcut Formula
n (x2) - (x)2 s = n (n - 1) This formula is easier to use if computing the standard deviation ‘by hand’ as it does not rely on the use of the mean. Only three values are needed - n, x, and x2. calculators can compute the sample standard deviation of data

59 Population Standard Deviation
 (x - µ) 2  = N  is the lowercase Greek ‘sigma’. Note the division by N(population size), rather than n-1 Most data is a sample, rather than a population; therefore, this formula is not use very often. calculators can compute the population standard deviation of data

60 for Standard Deviation
Symbols for Standard Deviation Sample Population s Sx xn-1 Textbook x xn Textbook Some graphics calculators Some graphics calculators Some non-graphics calculators Some non-graphics calculators Articles in professional journals and reports often use SD for standard deviation and VAR for variance.

61 Finding a Standard Deviation Using the TI83 Calculator
Push Stat Select Edit Enter data into L1 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1 (L1) The 4th output item is the sample s.d. and the 5th output item is the population s.d.

62 standard deviation squared
Measures of Variation Variance standard deviation squared s  } 2 After computing the standard deviation, square the resulting number using a calculator. use square key on calculator Notation 2

63 Variance  (x - x )2  (x - µ)2 n - 1  2 = N s2 = Sample Variance
The formulas for variance are the same as for standard deviation without the square root - remind student, squaring a square root will result in the radicand.  (x - µ)2 N  2 = Population Variance

64 Round-off Rule for measures of variation
Carry one more decimal place than is present in the original set of values. Round only the final answer, never in the middle of a calculation. Same rounding rule as for measures of center. page 75 of text

65 Standard Deviation from a Frequency Table
n [(f • x 2)] -[(f • x)]2 S = n (n - 1) The formula is quite tedious. Encourage students to use their calculators. Remind that the class midpoint must be used as the ‘representative’ score of each class for this computation. Use the class midpoints as the x values Calculators can compute the standard deviation for frequency table

66 Finding a Grouped Standard Deviation Using the TI83 Calculator
Push Stat Select Edit Enter class marks into L1 and frequencies into L2 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1, 2nd 2 (L1, L2) The 4th output item is the sample s.d and the 5th output item is the population s.d

67 Estimation of Standard Deviation highest value - lowest value
Range Rule of Thumb x s x - 2s x (maximum usual value) (minimum usual value) Range  4s or Reminder: range is the highest score minus the lowest score Range 4 highest value - lowest value s  = 4

68 Usual Sample Values minimum  x - 2(s) maximum  x + 2(s)
minimum ‘usual’ value  (mean) (standard deviation) minimum  x - 2(s) maximum ‘usual’ value  (mean) (standard deviation) maximum  x + 2(s) These ideas will be used repeatedly throughout the course.

69 (applies to bell-shaped distributions)
The Empirical Rule (applies to bell-shaped distributions) page 79 of text x

70 (applies to bell-shaped distributions)
The Empirical Rule (applies to bell-shaped distributions) 68% within 1 standard deviation Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean. 34% 34% x - s x x + s

71 (applies to bell-shaped distributions)
The Empirical Rule (applies to bell-shaped distributions) 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 13.5% 13.5% x - 2s x - s x x + s x + 2s

72 The Empirical Rule x - 3s x - 2s x - s x x + s x + 2s x + 3s
(applies to bell-shaped distributions) 99.7% of data are within 3 standard deviations of the mean 95% within 2 standard deviations 68% within 1 standard deviation These percentages will be verified by the concepts learned in Chapter 5. Emphasize the Empirical Rule is appropriate for data that is in a BELL-SHAPED distribution. 34% 34% 2.4% 2.4% 0.1% 0.1% 13.5% 13.5% x - 3s x - 2s x - s x x + s x + 2s x + 3s

73 Chebyshev’s Theorem applies to distributions of any shape.
the proportion (or fraction) of any set of data   lying within K standard deviations of the mean   is always at least 1 - 1/K2 , where K is any   positive number greater than 1. at least 3/4 (75%) of all values lie within 2  standard deviations of the mean. at least 8/9 (89%) of all values lie within 3  standard deviations of the mean. Emphasize Chebyshev’s Theorem applies to data that is in a distribution of any shape - that is, it is less prescriptive than the Empirical Rule. page 80 of text

74 Measures of Variation Summary
For typical data sets, it is unusual for a score to differ from the mean by more than 2 or 3 standard deviations. This idea will be revisited throughout the study of Elementary Statistics.

75 Measures of Position z Score (or standard score)
the number of standard deviations that a given value x is above or below the mean page 85 of text

76 Round to 2 decimal places
Measures of Position z score Sample Population x - µ x - x z = z = s Round to 2 decimal places

77 Interpreting Z Scores Z Unusual Values Ordinary Values Unusual Values
- 3 This concept of ‘unusual’ values will be revisited several times during the course, especially in Chapter 7 - Hypothesis Testing. page 86 of text - 2 - 1 1 2 3 Z

78 Quartiles, Deciles, Percentiles
Measures of Position Quartiles, Deciles, Percentiles

79 divides ranked scores into four equal parts
Quartiles Q1, Q2, Q3 divides ranked scores into four equal parts 25% 25% 25% 25% Q1 Q2 Q3 (minimum) These 5 numbers will be used in this text’s method for depicting boxplots. (maximum) (median)

80 divides ranked data into ten equal parts
Deciles D1, D2, D3, D4, D5, D6, D7, D8, D9 divides ranked data into ten equal parts 10% D D D D D D D D D9

81 Percentiles 99 Percentiles

82 Fractiles (Quantiles)
Quartiles, Deciles, Percentiles Fractiles (Quantiles) partitions data into approximately equal parts

83 Deciles Quartiles Q1 = P25 D1 = P10 Q2 = P50 D2 = P20 D3 = P30
D9 = P90 Consequently, to find a particular quartile or decile, determine the corresponding percentile and proceed using the percentile procedure. in margin of page 89 of text

84 Viewing Quartiles Using the TI83 Calculator
Push Stat Select Edit Enter data into L1 2nd mode (quit) >Calc Select 1 (1-VarStats) 2nd 1 (L1) Arrow down to view the values for Q2 and Q3 (the median “med” is Q2)

85 Exploratory Data Analysis
the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate the data sets in order to understand their important characteristics page 94 of text

86 Outliers a value located very far away from    almost all of the other values an extreme value can have a dramatic effect on the    mean, standard deviation, and on the    scale of the histogram so that the true    nature of the distribution is totally    obscured Outliers can have a dramatic effect on the certain descriptive statistics (mean and standard deviation). It is important to be aware of such data values.

87 Boxplots (Box-and-Whisker Diagram)
Reveals the: center of the data spread of the data distribution of the data presence of outliers Excellent for comparing two or more data sets page 96 of text

88 Boxplots 5 - number summary Minimum first quartile Q1 Median (Q2)
third quartile Q3 Maximum Medians and quartiles are not very sensitive to extreme values. Refer to Section 2-6 for finding the Q1 and Q3 values, and Section 2-4 for finding the median.

89 Boxplots 2 4 6 14 2 4 6 8 10 12 14 Boxplot of Qwerty Word Ratings
It appears that the Qwerty data set has a skewed right distribution page 97 of text 2 4 6 8 10 12 14 Boxplot of Qwerty Word Ratings

90 Viewing Boxplots Using the TI83 Calculator
First input class marks and frequencies in L1 & L2 2nd y= (Statplot) Select 1 Select “on” Select boxplot type (bottom row, 2nd graph) X list = 2nd 1 (L1) Freq = 2nd 2 (L2) Zoom 9 (ZoomStat)

91 Boxplots Bell-Shaped Uniform Skewed

92 Exploring Measures of center: mean, median, and mode
Measures of variation: standard deviation and range Measures of spread & relative location:     minimum values, maximum value, and quartiles Unusual values: outliers Distribution: histograms, stem-leaf plots, and boxplots page 97 of text


Download ppt "Descriptive Statistics"

Similar presentations


Ads by Google