Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive Statistics

Similar presentations


Presentation on theme: "Descriptive Statistics"— Presentation transcript:

1 Descriptive Statistics
Lecture 02: Tabular and Graphical Presentation of Data and Measures of Locations 2/17/2019

2 Presentation of Qualitative Variables
The simplest way of presenting/summarizing a qualitative variable is by using a frequency table, which shows the frequency of occurrence of each of the different categories. Such a table could also include the relative frequency, which indicates the proportion or percentage of occurrence of each of the categories. The frequency table could then be pictorially represented by a bar graph or a pie diagram. 2/17/2019

3 An Example A manufacturer of jeans has plants in California (CA), Arizona (AZ), and Texas (TX). A sample of 25 pairs of jeans was randomly selected from a computerized database, and the state in which each was produced was recorded. The data are as follows: CA AZ AZ TX CA CA CA TX TX TX AZ AZ CA AZ TX CA AZ TX TX TX CA AZ AZ CA CA Quite uninformative at this stage! Need to summarize to reveal information. 2/17/2019

4 The Frequency Table 2/17/2019

5 The Bar Chart Frequency 10 5 CA AZ TX 2/17/2019

6 Example … continued By looking at this frequency table and bar graph, one is able to obtain the information that there seems to be equal proportions of pairs of jeans being manufactured in the three states. Frequency table and bar graph certainly more informative than the raw presentation of the sample data. Another method of pictorial presentation of qualitative data is by using the pie diagram. In this case a pie is divided into the categories with a given category’s angle being equal to 360 degrees times the relative frequency of occurrence of that category. 2/17/2019

7 Pie Diagram CA Angles (in degrees): CA=(360)(.36)=129.6
AZ TX Angles (in degrees): CA=(360)(.36)=129.6 AZ=(360)(.32)=115.2 TX=(360)(.32)=115.2 129.6o 115.2o 115.2o 2/17/2019

8 Pie Chart from Minitab 2/17/2019

9 Presentation of Quantitative Variables
When the quantitative variable is discrete (such as counts), a frequency table and a bar graph could also be used for summarizing it. Only difference is that the values of the variables could not be reshuffled in the graph, in contrast to when the variable is categorical or qualitative. For example suppose that we asked a sample 20 students about the number of siblings in their family. The sample data might be: 4, 1, 6, 2, 2, 3, 4, 1, 2, 2, 3, 7, 2, 1, 1, 5, 3, 4, 6, 3 2/17/2019

10 Its Bar Graph is 2/17/2019

11 An Example of a Real Data Set: Poverty versus PACT in SC
Lunch ActualLang ActualMath 2/17/2019

12 Frequency Tables and Histograms
Consider the variable “Lunch,” which represents the percentage of students in the school district whose lunches are not free. The higher the value of this variable, the richer the district. n = Number of Observations = 86 LV = Lowest Value = 15 HV = Highest Value = 96 Let us construct a frequency table with classes: [10,20), [20,30), [30,40), …, [90,100) 2/17/2019

13 Frequency Table for Variable “Lunch”
2/17/2019

14 Frequency Histogram 2/17/2019

15 Stem-and-Leaf Plots An important tool for presenting quantitative data when the sample size is not too large is via a stem-and-leaf plot. By using this method, there is usually no loss of information in that the exact values of the observations could be recovered (in contrast to a frequency table for continuous data). Basic idea: To divide each observation into a stem and a leaf. The stems will serve as the ‘body of the plant’ while the leaves will serve as the ‘branches or leaves’ of the plant. An illustration makes the idea transparent. 2/17/2019

16 An Example A random sample of 30 subjects from the 1910 subjects in the blood pressure data set was selected. We present here the systolic blood pressures of these 30 subjects. 30 Systolic Blood Pressures: Lowest Value = 92, Highest Value = 135 Stems: 9,10, 11, 12, 13 Leaves: Ones Digit 2/17/2019

17 Stem-and-Leaf Plot 9 | 224 9 | 8 10 | 00024 10 | 88 11 | 00000000
9 | 224 9 | 8 10 | 10 | 88 11 | 11 | 88 12 | 00024 12 | 666 13 | 13 | 5 2/17/2019

18 Stem-and-Leaf … continued
In this stem-and-leaf plot, because there will only be 5 stems if we use 9, 10, 11, 12, 13, we decided to subdivide each stem into two parts corresponding to leaf values <= 4, and those >= 5. Such a procedure usually produces better looking distributions. Looking at this stem-and-leaf plot, notice that many of the observations are in the range of The exact values could be recovered from this plot. By arranging the leaves in ascending order, the plot also becomes more informative. 2/17/2019

19 Comparative Stem-and-Leaf Plots
When comparing the distributions of two groups (e.g., when classified according to GENDER), side-by-side stem-and-leaf plots (also side-by-side histograms) could be used. To illustrate, consider 30 observations from the blood pressure data set with Gender and Systolic Blood Pressure being the observed variables. For the males (Sex = 0): 122, 120, 130, 110, 134, 136, 142, 100, 120, 162, 126, 132, 124, 130 For the females (Sex = 1): 132, 94, 104, 100, 130, 110, 102, 110, 130, 92, 125, 108, 100, 130, 100, 100 2/17/2019

20 Comparing Male/Female Systolic Blood Pressures
2/17/2019

21 Scatterplots: Studying Relationship Between Poverty and Math
Question: What kind of relationship is there between Lunch and PACT Math Scores? 2/17/2019

22 Numerical Summary Measures
Overview Why do we need numerical summary measures? Measures of Location Measures of Variation Measures of Position Box Plots 2/17/2019

23 Why we Need Summary Measures?
“A picture is worth a thousand words, but beauty is always in the eyes of the beholder!” Graphs or pictures sometimes unwieldy Usually wants a small set of numbers that could provide the important features of the data set When making decisions, objectivity is enhanced when they are based on numbers! Numerical summaries and tabular/graphical presentations complement each other 2/17/2019

24 The Setting In defining and illustrating our summary measures, assume that we have sample data Sample Data: X1, X2, X3, …, Xn Sample Size: n These summary measures are thus (sample) statistics. If instead they are based on the population values, they will be (population) parameters. 2/17/2019

25 Measures of Location or Center
These are summary measures that provide information on the “center” of the data set Usually, these measures of location are where the observations cluster, but not always In layman’s terms, these measures are what we associate with “averages” Will discuss two measures: sample mean and sample median 2/17/2019

26 Sample Mean or Arithmetic Average
The sample mean equals the sum of the observations divided by the number of observations. It is defined symbolically via 2/17/2019

27 Properties of the Sample Mean
“Center of Gravity” Sum of the deviations of the observations from the mean is always zero (barring rounding errors) Sample mean could however be affected drastically by extreme or outliers The sample mean is very conducive to mathematical analysis compared to other measures of location 2/17/2019

28 Illustration Consider the systolic blood pressure data set considered in Lecture 01 Sample Size = n = 30 Data: 122, 135, 110, 126, 100, 110, 110, 126, 94, 124, 108, 110, 92, 98, 118, 110, 102, 108, 126, 104, 110, 120, 110, 118, 100, 110, 120, 100, 120, 92 2/17/2019

29 Sample Mean Computation
This value of could be interpreted as the balancing point of the 30 systolic blood pressure observations. Locating this in the histogram we have: 2/17/2019

30 Sample Mean in Histogram
2/17/2019

31 Sample Median Sample median (M) = value that divides the arranged/ordered data set into two equal parts. At least 50% are <= M and at least 50% are >= M Not sensitive to outliers but harder to deal with mathematically Appropriate when histogram is left or right-skewed Better to present both mean and median in practice 2/17/2019

32 Illustration of Computation of Median
Consider again the blood pressure data earlier. n=30: an even number. Median will be the average of the 15th and 16th observations in arranged data. Arranged data: 92, 92, 94, 98, 100, 100, 100, 102, 104, 108, 108, 110, 110, 110, 110, 110, 110, 110, 110, 118, 118, 120, 120, 120, 122, 124, 126, 126, 126, 135 2/17/2019

33 Continued ... The sample median is the average of 110 and 110, which are the 15th and 16th observations in the arranged data. The median equals 110. Note that it is very close to the sample mean value of 111.1 This closeness is because of the near symmetry of the distribution 2/17/2019

34 Relative Positions of Mean and Median
For symmetric distributions, the mean and the median coincide. For right-skewed distributions, the mean tends to be larger than the median (mean pulled up by the large extreme values) For left-skewed distributions, the mean tends to be smaller than the median (mean pulled down by the small extreme values) 2/17/2019


Download ppt "Descriptive Statistics"

Similar presentations


Ads by Google