Download presentation
Presentation is loading. Please wait.
Published bySharlene Hill Modified over 9 years ago
1
Chapter 3 Summarizing Data
2
Graphical Methods - 1 Variable After data collected, sorted into categories/ranges of values so that each individual observation falls in exactly one category/range –Numeric Responses: Break “range” of values into non- overlapping bins and count number of units in each bin –Categorical Responses: List all possible categories (with “Other” if needed), and count numbers of units in each Pie Chart: Displays percent in each category/range Bar Chart: Displays frequency/percent per category Histogram: Displays frequency/percent per “range”
3
Constructing Pie Charts Select a small number of categories (say 5 or 6 at most) to avoid many narrow “slivers” If possible, arrange categories in ascending or descending order for categorical variables
4
Monthly Philly Rainfall 1825-1869 (1/100 in)
5
Constructing Bar Charts Put frequencies on one axis (typically vertical, unless many categories) and categories on other Draw rectangles over categories with height=frequency Leave spaces between categories
6
Constructing Histograms Used for numeric variables, so need Class Intervals –Let Range = Largest - Smallest Measurement –Break range into (say) 5-20 intervals depending on sample size –Make the width of the subintervals a convenient unit, and make “break points” so that no observations fall on them –Obtain Class Frequencies, the number in each subinterval –Obtain Relative Frequencies, proportion in each subinterval Construct Histogram –Draw bars over each subinterval with height representing class frequency or relative frequency (shape will be the same) –Leave no space between bars to imply adjacency of class intervals
8
Interpreting Histograms Probability: Heights of bars over the class intervals are proportional to the “chances” an individual chosen at random would fall in the interval Unimodal: A histogram with a single major peak Bimodal: Histogram with two distinct peaks (often evidence of two distinct groups of units) Uniform: Interval heights are approximately equal Symmetric: Right and Left portions are same shape Right-Skewed: Right-hand side extends further Left-Skewed: Left-hand side extends further
9
Stem-and-Leaf Plots Simple, crude approach to obtaining shape of distribution without losing individual measurements to class intervals. Procedure: –Split each measurement into 2 sets of digits (stem and leaf) –List stems from smallest to largest –Line corresponding leaves aside stems from smallest to largest –If too cramped/narrow, break stems into two groups: low with leaves 0-4 and high with leaves 5-9 –When numbers have many digits, trim off right-most (less significant) digits. Leaves should always be a single digit.
10
Time Series Plots Many datasets represent a single variable measured on a single unit at different time points When measurements are made at equally spaced time points, goal is often to describe temporal variation Annual measurements can reveal long-term trends Sub-annual (weekly, monthly, quarterly) measurements can reveal long-term trends as well as seasonal fluctuations Plots generally have measurement on vertical axis and time period on horizontal. Some plots include bars around points to represent fluctuations within that time period
12
Numerical Descriptive Measures Numeric summaries of a set of measurements Measures of Central Tendency describe the “location” or center of a set of measurements Measures of Variability describe the “spread” or dispersion of a set of measurements Parameters: Numeric descriptive measures based on Populations of measurements Statistics: Numeric descriptive measures based on Samples of measurements
13
Measures of Central Tendency - I Mode: Most often occuring outcome (typically only of interest for variables taking on only “discrete” values) Median: Middle value when measurements ordered from smallest to largest Mean: Sum of all measurements, divided by total number of measurements (equal distribution of total) In practice, we only observe sample, and use to estimate
14
Example - Philadelphia Rainfall Note: The mean is higher than median as a few very large amounts were observed.
15
Measures of Central Tendency - II Outlier: Individual measurement(s) falling far away from others. Can have large effect on mean, not median Trimmed Mean (TM): Mean that is based on center measurements (deleting extreme measurements). Mode: For continuous (smooth) distributions, mode is value corresponding to the peak of the frequency curve Skewness: Shape of the distribution: –Mound-Shaped Distributions: Mode Median Mean TM –Right-Skewed Distributions: Mode < Median < TM < Mean –Left-Skewed Distributions: Mean < TM < Median < Mode
16
Measures of Variability - I Variability: Magnitude of dispersion in data. Range: Difference between largest and smallest measurements in a set. p th -Percentile: Value that has at most p% of measurements below, and (100-p)% above it (0<p<100) –Lower Quartile = 25 th Percentile (Q 1 ) –Median = 50 th Percentile (Q 2 ) –Upper Quartile = 75 th Percentile (Q 3 ) Interquartile Range: Difference between the upper and lower quartiles (measures the amount of spread in he middle 50% of ordered measurements). IQR = Q 3 -Q 1
17
Measures of Variability - II Deviation: Distance between an individual measurement and the group mean: Variance: “Average” squared deviation Standard Deviation: Square root variance (data’s units) Empirical rule (measurements with mound-shaped histogram) Approximately 68% of measurements lie within 1 SD of mean Approximately 95% of measurements lie within 2 SD of mean Virtually all of measurements lie within 3 SD of mean
18
Example - Philadelphia Rainfall (Population) Note: 383 (71%) Months lie within 1 of and 518 (96%) within 2
19
Boxplots Graph highlighting spread of set of measurements, highlighting quartiles and outliers. Constructing a boxplot: –Draw box with top at Q 3, bottom at Q 1, and line crossing at median (Q 2 ). Height of box is IQR = Q 3 - Q 1 –Compute “lower inner fence” = Q 1 -1.5(IQR) = LIF –Compute “upper inner fence” = Q 3 +1.5(IQR) = UIF –Compute “lower outer fence” = Q 1 -3.0(IQR) = LOF – Compute “upper outer fence” = Q 3 +3.0(IQR) = UOF –Draw line from Q 3 to max(UIF, largest y value). Place ‘*’ for any y values between UIF and UOF, ‘o’ for any above UOF –Draw line from Q 1 to min(LIF, smallest y value). Place ‘*’ for any y values between LIF and LOF, ‘o’ for any below LOF
20
UIF = 468+1.5(232.25) = 816.375 UOF = 468+3(232.25) = 1164.75
21
Summarizing Data of More than One Variable Contingency Table: Cross-tabulation of units based on measurements of two qualitative variables simultaneously Stacked Bar Graph: Bar chart with one variable represented on the horizontal axis, second variable as subcategories within bars Cluster Bar Graph: Bar chart with one variable forming “major groupings” on horizontal axis, second variable used to make side-by-side comparisons within major groupings (displays all combinations in factorial expt) Scatterplot: Plot with quantitaive variables y and x plotted against each other for each unit Side-by-Side Boxplot: Compares distributions by groups
22
Example - Ginkgo and Acetazolamide for Acute Mountain Syndrome Among Himalayan Trekkers Contingency Table (Counts) Percent Outcome by Treatment
26
Scatterplots Identify the explanatory and response variables of interest, and label them as x and y Obtain a set of individuals and observe the pairs (x i, y i ) for each pair. There will be n pairs. Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots) Plot the n pairs of points (x,y) on the graph
27
France August,2003 Heat Wave Deaths Individuals: 13 cities in France Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002 Explanatory Variable: Change in Mean Temp in period (C) Data:
28
France August,2003 Heat Wave Deaths Possible Outlier
29
Example - Pharmacodynamics of LSD Response (y) - Math score (mean among 5 volunteers) Explanatory (x) - LSD tissue concentration (mean of 5 volunteers) Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)
30
Manufacturer Production/Cost Relation X= Amount Produced Y= Total Cost n=48 months (not in order)
31
Manufacturer Production/Cost Relation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.