Download presentation
Presentation is loading. Please wait.
Published byDayna McLaughlin Modified over 9 years ago
2
STATISTICS for MANAGERS Fellowship Course on Health System Management A Keshtkar MD, MPH, PhD Assistant Professor of Epidemiology
3
Why a Manager Needs to Know about Statistics To know how to: properly present information draw conclusions about populations based on sample information improve processes obtain reliable forecasts
4
Key Definitions A population (universe) is the collection of all items or things under consideration A sample is a portion of the population selected for analysis A parameter is a summary measure that describes a characteristic of the population
5
Population vs. Sample a b c d ef gh i jk l m n o p q rs t u v w x y z PopulationSample b c g i n o r u y Measures used to describe the population are called parameters For example: population MEAN Measures computed from sample data are called statistics For example: sample MEAN
6
Two Branches of Statistics Descriptive statistics Collecting, summarizing, and describing data Inferential statistics Drawing conclusions and/or making decisions concerning a population based only on sample data
7
Descriptive Statistics Collect data e.g., Survey Present data e.g., Tables and graphs Characterize data e.g., Sample mean = 3 major Functions:
8
Inferential Statistics Estimation e.g., Estimate the population mean weight using the sample mean weight Hypothesis testing e.g., Test the claim that the population mean weight is 120 pounds Drawing conclusions and/or making decisions concerning a population based on sample results. 2 major Functions:
9
Data Sources Secondary Data Compilation Observation Experimentation Print or Electronic Survey Primary Data Collection
10
Reasons for Drawing a Sample Less time consuming than a census Less costly to administer than a census Less cumbersome and more practical to administer than a census of the targeted population
11
Non-probability Sampling Items included are chosen without regard to their probability of occurrence Probability Sampling Items in the sample are chosen on the basis of known probabilities Types of Sampling Methods
12
Types of Samples Used Quota Samples Non-Probability Samples JudgementChunk Probability Samples Simple Random Systematic Stratified Cluster Convenience (continued)
13
Probability Sampling Items in the sample are chosen based on known probabilities Probability Samples Simple Random SystematicStratifiedCluster
14
Simple Random Samples Every individual or item from the frame has an equal chance of being selected Selection may be with replacement or without replacement Samples obtained from table of random numbers or computer random number generators
15
Decide on sample size: n Divide frame of N individuals into groups of k individuals: k=N/n Randomly select one individual from the 1 st group Select every k th individual thereafter Systematic Samples N = 64 n = 8 k = 8 First Group
16
Stratified Samples Divide population into two or more subgroups (called strata) according to some common characteristic A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes Samples from subgroups are combined into one Population Divided into 4 strata Sample
17
Cluster Samples Population is divided into several “clusters,” each representative of the population A simple random sample of clusters is selected All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique Population divided into 16 clusters. Randomly selected clusters for sample
18
Advantages and Disadvantages Simple random sample and systematic sample Simple to use May not be a good representation of the population’s underlying characteristics Stratified sample Ensures representation of individuals across the entire population Cluster sample More cost effective Less efficient (need larger sample to acquire the same level of precision)
19
Types of Data Data Categorical Numerical DiscreteContinuous Examples: Marital Status Political Party Eye Color (Defined categories) Examples: Number of Children Defects per hour (Counted items) Examples: Weight Voltage (Measured characteristics)
20
Levels of Measurement and Measurement Scales Interval Data Ordinal Data Nominal Data Highest Level Strongest forms of measurement Higher Level Lowest Level Weakest form of measurement Categories (no ordering or direction) Ordered Categories (rankings, order, or scaling) Differences between measurements but no true zero Ratio Data Differences between measurements, true zero exists
21
Definition of SURVEY A “survey” is a study type that usually has two characteristics: 1. Representativeness is an important goal 2. Data collection tool & method is questionnaire and interview/ QA-ing (Questioning & Answering) respectively.
22
Evaluating Survey Worthiness What is the purpose of the survey? Is the survey based on a probability sample? Coverage error – appropriate frame? Non-response error – follow up Measurement error – good questions elicit good responses Sampling error – always exists
23
Types of Survey Errors Coverage error or selection bias Exists if some groups are excluded from the frame and have no chance of being selected Non response error or bias People who do not respond may be different from those who do respond Sampling error Variation from sample to sample will always exist Measurement error Due to weaknesses in question design, respondent error, and interviewer’s effects on the respondent
24
Types of Survey Errors Coverage error Non-response error Sampling error Measurement error Excluded from frame Follow up on nonresponses Random differences from sample to sample Bad or leading question (continued)
25
Organizing and Presenting Data Graphically Data in raw form are usually not easy to use for decision making Some type of organization is needed Table Graph Techniques reviewed here: Frequency Distributions and Histograms Bar charts and pie charts Contingency tables
26
Tables and Charts for Numerical Data Numerical Data Discrete Data Line or Polygon HistogramPolygonBox plot Continuous Data Frequency Distributions and Cumulative Distributions
27
What is a Frequency Distribution? A frequency distribution is a list or a table … containing class groupings (categories or ranges within which the data falls)... and the corresponding frequencies with which data falls within each grouping or category Tabulating Numerical Data: Frequency Distributions
28
Why Use Frequency Distributions? A frequency distribution is a way to summarize data The distribution condenses the raw data into a more useful form... and allows for a quick visual interpretation of the data
29
Class Intervals and Class Boundaries Each class grouping has the same width Determine the width of each interval by Use at least 5 but no more than 15 groupings Class boundaries never overlap Round up the interval width to get desirable endpoints
30
Frequency Distribution Example Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
31
Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Find range: 58 - 12 = 46 Select number of classes: 5 (usually between 5 and 15) Compute class interval (width): 10 (46/5 then round up) Determine class boundaries (limits): 10, 20, 30, 40, 50, 60 Compute class midpoints: 15, 25, 35, 45, 55 Count observations & assign to classes Frequency Distribution Example (continued)
32
Frequency Distribution Example Class Frequency 10 but less than 20 3.15 15 20 but less than 30 6.30 30 30 but less than 40 5.25 25 40 but less than 50 4.20 20 50 but less than 60 2.10 10 Total 20 1.00 100 Relative Frequency Percentage Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 (continued)
33
Graphing Numerical Data: The Histogram A graph of the data in a frequency distribution is called a histogram The class boundaries (or class midpoints) are shown on the horizontal axis the vertical axis is either frequency, relative frequency, or percentage Bars of the appropriate heights are used to represent the number of observations within each class
34
Class Midpoints Histogram Example (No gaps between bars) Class 10 but less than 20 15 3 20 but less than 30 25 6 30 but less than 40 35 5 40 but less than 50 45 4 50 but less than 60 55 2 Frequency Class Midpoint
35
Histograms in Excel Select Tools/Data Analysis 1
36
Choose Histogram 2 3 Input data range and bin range (bin range is a cell range containing the upper class boundaries for each class grouping) Select Chart Output and click “OK” Histograms in Excel (continued) (
37
Questions for Grouping Data into Classes 1.How wide should each interval be? (How many classes should be used?) 2.How should the endpoints of the intervals be determined? Often answered by trial and error, subject to user judgment The goal is to create a distribution that is neither too "jagged" nor too "blocky” Goal is to appropriately show the pattern of variation in the data
38
How Many Class Intervals? Many (Narrow class intervals) may yield a very jagged distribution with gaps from empty classes Can give a poor indication of how frequency varies across classes Few (Wide class intervals) may compress variation too much and yield a blocky distribution can obscure important patterns of variation. (X axis labels are upper class endpoints)
39
Graphing Numerical Data: The Frequency Polygon Class Midpoints Class 10 but less than 20 15 3 20 but less than 30 25 6 30 but less than 40 35 5 40 but less than 50 45 4 50 but less than 60 55 2 Frequency Class Midpoint (In a percentage polygon the vertical axis would be defined to show the percentage of observations per class)
40
Tabulating Numerical Data: Cumulative Frequency Class 10 but less than 20 3 15 3 15 20 but less than 30 6 30 9 45 30 but less than 40 5 25 14 70 40 but less than 50 4 20 18 90 50 but less than 60 2 10 20 100 Total 20 100 Percentage Cumulative Percentage Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Frequency Cumulative Frequency
41
Graphing Cumulative Frequencies: The Ogive (Cumulative % Polygon) Class Boundaries (Not Midpoints) Class Less than 10 10 0 10 but less than 20 20 15 20 but less than 30 30 45 30 but less than 40 40 70 40 but less than 50 50 90 50 but less than 60 60 100 Cumulative Percentage Lower class boundary
42
Scatter Diagrams are used for bivariate numerical data Bivariate data consists of paired observations taken from two numerical variables The Scatter Diagram: one variable is measured on the vertical axis and the other variable is measured on the horizontal axis Scatter Diagrams
43
Scatter Diagram Example Volume per day Cost per day 23125 26140 29146 33160 38167 42170 50188 55195 60200
44
Scatter Diagrams in Excel Select the chart wizard 1 2 Select XY(Scatter) option, then click “Next” When prompted, enter the data range, desired legend, and desired destination to complete the scatter diagram 3
45
Tables and Charts for Categorical Data Categorical Data Graphing Data Pie Charts Pareto Diagram Bar Charts Tabulating Data Summary Table
46
The Summary Table Example: Current Investment Portfolio Investment Amount Percentage Type (in thousands $) (%) Stocks 46.5 42.27 Bonds 32.0 29.09 CD 15.5 14.09 Savings 16.0 14.55 Total 110.0 100.0 (Variables are Categorical) Summarize data by category
47
Bar and Pie Charts Bar charts and Pie charts are often used for qualitative (category) data Height of bar or size of pie slice shows the frequency or percentage for each category
48
Bar Chart Example Investment Amount Percentage Type (in thousands $) (%) Stocks 46.5 42.27 Bonds 32.0 29.09 CD 15.5 14.09 Savings 16.0 14.55 Total 110.0 100.0 Current Investment Portfolio
49
Pie Chart Example Percentages are rounded to the nearest percent Current Investment Portfolio Savings 15% CD 14% Bonds 29% Stocks 42% Investment Amount Percentage Type (in thousands $) (%) Stocks 46.5 42.27 Bonds 32.0 29.09 CD 15.5 14.09 Savings 16.0 14.55 Total 110.0 100.0
50
Pareto Diagram Used to portray categorical data A bar chart, where categories are shown in descending order of frequency A cumulative polygon is often shown in the same graph Used to separate the “vital few” from the “trivial many”
51
Pareto Diagram Example cumulative % invested (line graph) % invested in each category (bar graph) Current Investment Portfolio
52
Tabulating and Graphing Multivariate Categorical Data Contingency Table for Investment Choices ($1000’s) Investment Investor A Investor B Investor C Total Category Stocks 46.5 55 27.5 129 Bonds 32.0 44 19.0 95 CD 15.5 20 13.5 49 Savings 16.0 28 7.0 51 Total 110.0 147 67.0 324 (Individual values could also be expressed as percentages of the overall total, percentages of the row totals, or percentages of the column totals)
53
Side by side bar charts (continued) Tabulating and Graphing Multivariate Categorical Data
54
Side-by-Side Chart Example Sales by quarter for three sales territories:
55
Principles of Graphical Excellence Present data in a way that provides substance, statistics and design Communicate complex ideas with clarity, precision and efficiency Give the largest number of ideas in the most efficient manner Excellence almost always involves several dimensions Tell the truth about the data
56
Using “chart junk” Failing to provide a relative basis in comparing data between groups Compressing or distorting the vertical axis Providing no zero point on the vertical axis Errors in Presenting Data
57
Chart Junk Good Presentation 1960: $1.00 1970: $1.60 1980: $3.10 1990: $3.80 Minimum Wage 0 2 4 1960197019801990 $ Bad Presentation
58
No Relative Basis Good Presentation A’s received by students. Bad Presentation 0 200 300 FRSOJRSR Freq. 10% 30% FRSOJRSR FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior listen 100 20% 0% %
59
Compressing Vertical Axis Good Presentation Quarterly Sales Bad Presentation 0 25 50 Q1Q2Q3 Q4 $ 0 100 200 Q1Q2 Q3 Q4 $
60
No Zero Point On Vertical Axis Monthly Sales 0 39 42 45 J F MAMJ $ 36 0 20 40 60 JFM A MJ $ Good Presentations Monthly Sales Bad Presentation 36 39 42 45 JFMAMJ $ Graphing the first six months of sales or
61
Different Measures for Describing Data Measures of central tendency, variation, and shape Mean, median, mode, geometric mean Quartiles Range, interquartile range (IQR), variance and standard deviation, coefficient of variation (CV) Symmetric and skewed distributions Population summary measures Mean, variance, and standard deviation Normal Distribution versus Non-normal Distribution The empirical ND rule and Chebyshev rule
62
Summary Measures Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Coefficient of Variation Range Interquartile Range Geometric Mean Skewness Central TendencyVariationShapeQuartiles
63
Measures of Central Tendency Central Tendency Arithmetic MeanMedian ModeGeometric Mean Overview Midpoint of ranked values Most frequently observed value
64
Arithmetic Mean The arithmetic mean (mean) is the most common measure of central tendency For a sample of size n: Sample size Observed values
65
Arithmetic Mean The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) (continued) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4
66
Median In an ordered array, the median is the “middle” number (50% above, 50% below) Not affected by extreme values 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3
67
Finding the Median The location of the median: If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers Note that is not the value of the median, only the position of the median in the ranked data
68
Mode A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may may be no mode There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode
69
Five houses on a hill by the beach Review Example House Prices: $2,000,000 500,000 300,000 100,000 100,000
70
Review Example: Summary Statistics Mean: ($3,000,000/5) = $600,000 Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 House Prices: $2,000,000 500,000 300,000 100,000 100,000 Sum 3,000,000
71
Mean is generally used, unless extreme values (outliers) exist Then median is often used, since the median is not sensitive to extreme values. Example: Median home prices may be reported for a region – less sensitive to outliers Which measure of location is the “best”?
72
Geometric Mean Geometric mean Used to measure the rate of change of a variable over time Geometric mean rate of return Measures the status of an investment over time Where R i is the rate of return in time period i
73
Example An investment of $100,000 declined to $50,000 at the end of year one and rebounded to $100,000 at end of year two: 50% decrease 100% increase The overall two-year return is zero, since it started and ended at the same level.
74
Example Use the 1-year returns to compute the arithmetic mean and the geometric mean: Arithmetic mean rate of return: Geometric mean rate of return: Misleading result More accurate result (continued)
75
Quartiles Quartiles split the ranked data into 4 segments with an equal number of values per segment 25% The first quartile, Q 1, is the value for which 25% of the observations are smaller and 75% are larger Q 2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile Q1Q2Q3
76
Quartile Formulas Find a quartile by determining the value in the appropriate position in the ranked data, where First quartile position: Q 1 = (n+1)/4 Second quartile position: Q 2 = (n+1)/2 (the median position) Third quartile position: Q 3 = 3(n+1)/4 where n is the number of observed values
77
(n = 9) Q 1 = is in the (9+1)/4 = 2.5 position of the ranked data so use the value half way between the 2 nd and 3 rd values, so Q 1 = 12.5 Quartiles Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 Example: Find the first quartile Q 1 and Q 3 are measures of noncentral location Q 2 = median, a measure of central tendency
78
Same center, different variation Measures of Variation Variation Variance Standard Deviation Coefficient of Variation RangeInterquartile Range Measures of variation give information on the spread or variability of the data values.
79
Range Simplest measure of variation Difference between the largest and the smallest observations: Range = X largest – X smallest 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 Example:
80
Ignores the way in which data are distributed Sensitive to outliers 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 Disadvantages of the Range 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119
81
Interquartile Range Can eliminate some outlier problems by using the interquartile range Eliminate some high- and low-valued observations and calculate the range from the remaining values Interquartile range = 3 rd quartile – 1 st quartile = Q 3 – Q 1
82
Interquartile Range Median (Q2) X maximum X minimum Q1Q3 Example: 25% 25% 12 30 45 57 70 Interquartile range = 57 – 30 = 27
83
Average (approximately) of squared deviations of values from the mean Sample variance: Variance Where = arithmetic mean n = sample size X i = i th value of the variable X
84
Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data Sample standard deviation:
85
Calculation Example: Sample Standard Deviation Sample Data (X i ) : 10 12 14 15 17 18 18 24 n = 8 Mean = X = 16 A measure of the “average” scatter around the mean
86
Measuring variation Small standard deviation Large standard deviation
87
Comparing Standard Deviations Mean = 15.5 S = 3.338 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 S = 0.926 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 S = 4.570 Data C
88
Advantages of Variance and Standard Deviation Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared)
89
Coefficient of Variation Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of data measured in different units
90
Comparing Coefficient of Variation Stock A: Average price last year = $50 Standard deviation = $5 Stock B: Average price last year = $100 Standard deviation = $5 Both stocks have the same standard deviation, but stock B is less variable relative to its price
91
Shape of a Distribution Describes how data is distributed Measures of shape Symmetric or skewed Mean = Median Mean < Median Median < Mean Right-Skewed Left-SkewedSymmetric
92
Using Microsoft Excel Descriptive Statistics can be obtained from Microsoft ® Excel Use menu choice: tools / data analysis / descriptive statistics Enter details in dialog box
93
Using Excel Use menu choice: tools / data analysis / descriptive statistics
94
Enter dialog box details Check box for summary statistics Click OK Using Excel (continued)
95
Excel output Microsoft Excel descriptive statistics output, using the house price data: House Prices: $2,000,000 500,000 300,000 100,000 100,000
96
Population Summary Measures Population summary measures are called parameters The population mean is the sum of the values in the population divided by the population size, N μ = population mean N = population size X i = i th value of the variable X Where
97
Average of squared deviations of values from the mean Population variance: Population Variance Where μ = population mean N = population size X i = i th value of the variable X
98
Population Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data Population standard deviation:
99
If the data distribution is bell-shaped, then the interval: contains about 68% of the values in the population or the sample The Empirical Rule 68%
100
contains about 95% of the values in the population or the sample contains about 99.7% of the values in the population or the sample The Empirical Rule 99.7%95%
101
Regardless of how the data are distributed, at least (1 - 1/k 2 ) of the values will fall within k standard deviations of the mean (for k > 1) Examples: (1 - 1/1 2 ) = 0% ……..... k=1 (μ ± 1σ) (1 - 1/2 2 ) = 75% …........ k=2 (μ ± 2σ) (1 - 1/3 2 ) = 89% ………. k=3 (μ ± 3σ) Chebyshev Rule withinAt least
102
Exploratory Data Analysis Box-and-Whisker Plot: A Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum Example: 25% 25%
103
Shape of Box-and-Whisker Plots The Box and central line are centered between the endpoints if data are symmetric around the median A Box-and-Whisker plot can be shown in either vertical or horizontal format Min Q 1 Median Q 3 Max
104
Distribution Shape and Box-and-Whisker Plot Right-SkewedLeft-SkewedSymmetric Q1Q2Q3Q1Q2Q3 Q1Q2Q3
105
Box-and-Whisker Plot Example Below is a Box-and-Whisker plot for the following data: 0 2 2 2 3 3 4 5 5 10 27 This data is right skewed, as the plot depicts 0 2 3 5 27 Min Q1 Q2 Q3 Max
106
The Sample Covariance The sample covariance measures the strength of the linear relationship between two variables (called bivariate data) The sample covariance: Only concerned with the strength of the relationship No causal effect is implied
107
Covariance between two random variables: cov(X,Y) > 0 X and Y tend to move in the same direction cov(X,Y) < 0 X and Y tend to move in opposite directions cov(X,Y) = 0 X and Y are independent Interpreting Covariance
108
Coefficient of Correlation Measures the relative strength of the linear relationship between two variables Sample coefficient of correlation:
109
Features of Correlation Coefficient, r Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship
110
Scatter Plots of Data with Various Correlation Coefficients Y X Y X Y X Y X Y X r = -1 r = -.6r = 0 r = +.3 r = +1 Y X r = 0
111
Using Excel to Find the Correlation Coefficient Select Tools/Data Analysis Choose Correlation from the selection menu Click OK...
112
Using Excel to Find the Correlation Coefficient Input data range and select appropriate options Click OK to get output (continued)
113
Interpreting the Result r =.733 There is a relatively strong positive linear relationship between test score #1 and test score #2 Students who scored high on the first test tended to score high on second test
114
Pitfalls in Numerical Descriptive Measures Data analysis is objective Should report the summary measures that best meet the assumptions about the data set Data interpretation is subjective Should be done in fair, neutral and clear manner
115
Ethical Considerations Numerical descriptive measures: Should document both good and bad results Should be presented in a fair, objective and neutral manner Should not use inappropriate summary measures to distort facts
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.