Download presentation
Presentation is loading. Please wait.
Published byCornelia Wheeler Modified over 6 years ago
1
Laugh, and the world laughs with you. Weep and you weep alone
Laugh, and the world laughs with you. Weep and you weep alone. ~Shakespeare~
2
Chapter 3: Data Description
Types of data Graphical/Numerical summaries
3
What are Data? Any set of data contains information about some group of individuals. The information is organized in variables.
4
Terms A population is a collection of all individuals about which information is desired. A sample is a subset of a population. A variable is a characteristic of an individual. The distribution of a variable tells us what values/categories it takes and how often it takes those values/categories in the population.
5
Data Analysis Goal: to study how variables relate to one another in a population Method: estimating the distributions of variables (in the whole population) by summarizing the distributions of data on those variables
6
Example: A College’s Student Dataset
The data set includes data about all currently enrolled students such as their ages, genders, heights, grades, and choices of major. Who? What individuals do the data describe? Population/sample of study? What? How many variables do the data describe? Give an example of variables.
7
Types of Variables A categorical variable places an individual into one of several groups or categories. A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense. Q. Which variable is categorical ? Quantitative?
8
Q: Does “average” make sense?
No Yes Q: Is there any natural ordering among categories? Q: Can all possible values be listed down? No Yes Yes No
9
Two Basic Strategies to Explore Data
Begin by examining each variable by itself. Then move on to study the relationship among the variables. Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.
10
O/L Table 3.1 Third Grade Data
11
Summarizing Data We will start from summarizing data on a variable to on several variables by: Displaying the distribution of data with graphs Describing the distribution of data with numbers
12
Terms Frequency = the # of individuals in a category or at a value.
Relative frequency = the % of individuals in a category or at a value. They both can be used to display the distribution of data.
13
Graphical Tools for One Variable
For a categorical variable: Pie charts Bar graphs For a quantitative variable: Histograms Stem-and-leaf plots/ dotplots Boxplots
14
How to Make a Pie Chart Calculate the % for each category
Draw a pie and slice it accordingly.
15
Class Absence on First Day
Pie Chart
16
How to Make a Bar Chart Label frequencies on one axis and categories of the variable on the other axis. Construct a rectangle at each category of the variable with a height equal to the frequency in the category. Leave a space between categories
17
Class Absence on First Day
Bar Graph
18
Displaying Distributions of Quantitative Variables
Stem-and-leaf plots: good for small to medium datasets Histograms: Similar to bar charts; good for medium to large datasets Dot plots: good for data with many repeated values
19
How to Make a Histogram Divide the range of data by the approximate # of intervals desired (usually 5-20). Round the resulting number to a convenient number (the common width for the intervals). Construct intervals with the common width so that the first interval contains the smallest data value and the last interval contains the largest data value. Draw the histogram: the variable on the horizontal axis and the count (or %) on the vertical axis.
20
Histograms: Class Intervals
How many intervals? One rule is to calculate the square root of the sample size, and round up. Size of intervals? Divide range of data (maxmin) by number of intervals desired, and round to convenient number Pick intervals so each observation can only fall in exactly one interval (no overlap) BPS - 5th Ed. Chapter 1
21
What do We See from Histograms?
Important features we should look for: Overall pattern Shape Center Spread Outliers, the values that fall far outside the overall pattern
22
How to Make a Stemplot Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Example: height of 68.5 leaf = “5” and the other digit “68” will be the stem
23
How to Make a Stemplot Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem.
24
Weight Data: Stemplot (Stem & Leaf Plot)
Basic Practice of Statistics - 5th Edition 11 009 14 08 16 555 19 245 20 3 21 025 22 0 23 24 25 26 0 Weight Data: Stemplot (Stem & Leaf Plot) Key 20|3 means 203 pounds Stems = 10’s Leaves = 1’s The student should construct a stem & leaf plot here using the first two digits as the stem and the last digit as the leaf. The shape of the stem & leaf plot should look similar to the bar graph shown on an upcoming slide. Chapter 1
25
Overall Pattern—Shape
How many peaks, called modes? A distribution with one peak is called unimodal. Symmetric or skewed? Symmetric if the large values are mirror images of small values Skewed to the right if the right tail (large values) is much longer than the left tail (small values) Skewed to the left if the left tail (small values) is much longer than the right tail (large values)
26
Describing Data on a Quantitative Variable
(Sec 3.4) To measure center: Mode, Mean and Median (Sec 3.5) To measure variability: Range, Interquartile Range (IQR) and Standard Deviation (SD) Outliers (Sec 3.6) Five-number summary and boxplot
27
Basic Practice of Statistics - 5th Edition
Quartiles Three numbers which divide the ordered data into four equal sized groups. Q1 has 25% of the data below it. Q2 has 50% of the data below it. (Median) Q3 has 75% of the data below it. BPS - 5th Ed. Chapter 2 Chapter 2
28
Obtaining the Quartiles
Basic Practice of Statistics - 5th Edition Obtaining the Quartiles Order the data. For Q2, just find the median. For Q1, look at the lower half of the data values, those to the left of the median location; find the median of this lower half. For Q3, look at the upper half of the data values, those to the right of the median location; find the median of this upper half. BPS - 5th Ed. Chapter 2 Chapter 2
29
Basic Practice of Statistics - 5th Edition
Weight Data: Sorted L(M)=(53+1)/2=27 L(Q1)=(26+1)/2=13.5 Chapter 2
30
Weight Data: Quartiles
Basic Practice of Statistics - 5th Edition Weight Data: Quartiles Q1= 127.5 Q2= (Median) Q3= 185 BPS - 5th Ed. Chapter 2 Chapter 2
31
Basic Practice of Statistics - 5th Edition
Five-Number Summary minimum = 100 Q1 = 127.5 M = 165 Q3 = 185 maximum = 260 Interquartile Range (IQR) = Q3 Q1 = 57.5 IQR gives spread of middle 50% of the data Chapter 2
32
Basic Practice of Statistics - 5th Edition
Weight Data: Boxplot Q1 M Q3 min max Weight Chapter 2
33
Basic Practice of Statistics - 5th Edition
Identifying Outliers The central box of a boxplot spans Q1 and Q3; recall that this distance is the Interquartile Range (IQR). We call an observation a mild (or extreme) outlier if it falls more than 1.5 (or 3.0) IQR above the third quartile or below the first quartile. Chapter 2
34
Summarizing Data from 2 Variables
2 categorical var’s Contingency table (Cluster or stacked) bar chart 2 quantitative var’s Regression equation Scatterplot 1 categorical + 1 quantitative var Side-by-side boxplot
35
Time Plots A time plot shows behavior over time.
Time is always on the horizontal axis, and the variable being measured is on the vertical axis. Look for an overall pattern (trend), and deviations from this trend. Connecting the data points by lines may emphasize this trend. Look for patterns that repeat at known regular intervals (seasonal variations). BPS - 5th Ed. Chapter 1
36
Average Tuition (Public vs. Private)
37
Empirical Rule (68-95-99.7 rule)
If a variable X follows normal distribution, that is, all X values (the whole population) show bell-shaped, then: Mean(X) + 1*SD(X) covers 68% of possible X values Mean(X) + 2*SD(X) covers 95% of possible X values Mean(X) + 3*SD(X) covers 99.7% of possible X values
38
z-Scores & The Empirical Rule
Since the z-score is the number of standard deviations from the mean, we can easily interpret the z-score for bell-shaped populations using The Empirical Rule. When a population has a histogram that is approximately bell-shaped, then Approximately 68% of the data will have z-scores between –1 and 1. Approximately 95% of the data will have z-scores between –2 and 2. All, or almost all of the data will have z-scores between –3 and 3. z = – z = – z = – z = z = z = 3 Copyright ©2014 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.