Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Chapter 2 Describing Data with Numerical Measurements
Describing distributions with numbers
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 1 Overview and Descriptive Statistics.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.
Describing distributions with numbers
Descriptive Statistics: Presenting and Describing Data.
1 Chapter 4 Numerical Methods for Describing Data.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Introduction to statistics I Sophia King Rm. P24 HWB
Numerical descriptions of distributions
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA What trends can be determined from individual data sets?
Exploratory Data Analysis
1 - Introduction 2 - Exploratory Data Analysis 3 - Probability Theory 4 - Classical Probability Distributions 5 - Sampling Distrbns / Central Limit Theorem.
Virtual University of Pakistan
Descriptive Statistics ( )
Continuous random variables
The rise of statistics Statistics is the science of collecting, organizing and interpreting data. The goal of statistics is to gain understanding from.
Exploratory Data Analysis
Descriptive Statistics Measures of Variation
Chapter 1: Exploring Data
Measures of Dispersion
SPSS CODING/GRAPHS & CHARTS CENTRAL TENDENCY & DISPERSION
Business and Economics 6th Edition
MATH-138 Elementary Statistics
Numerical descriptions of distributions
PROBABILITY AND STATISTICS
Chapter 1 Overview and Descriptive Statistics
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Chapter 5 : Describing Distributions Numerically I
Chapter 2: Methods for Describing Data Sets
Numerical Descriptive Measures
Stat 2411 Statistical Methods
Descriptive Statistics (Part 2)
Objective: Given a data set, compute measures of center and spread.
Reasoning in Psychology Using Statistics
Descriptive Statistics: Presenting and Describing Data
NUMERICAL DESCRIPTIVE MEASURES
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Descriptive Statistics
Descriptive Statistics
MEASURES OF CENTRAL TENDENCY
DAY 3 Sections 1.2 and 1.3.
Histograms: Earthquake Magnitudes
Descriptive and inferential statistics. Confidence interval
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Describing Data with Numerical Measures
Warmup Draw a stemplot Describe the distribution (SOCS)
Displaying Distributions with Graphs
Displaying and Summarizing Quantitative Data
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
Honors Stats Chapter 4 Part 6
Displaying and Summarizing Quantitative Data
Chapter 1: Exploring Data
Numerical Descriptive Measures
Measures of Center.
Honors Statistics Review Chapters 4 - 5
Good morning! Please get out your homework for a check.
Business and Economics 7th Edition
Presentation transcript:

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot… R code: x = c(60, 70, 80, 90) freq = c(2, 8, 4, 6) sample = rep(x, freq) stripchart(sample, method = "stack", pch = 19, offset = 1, ylim = range(1, 8)) Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot… Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20: “Density” Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequencies 2/20 = 0.10 8/20 = 0.40 4/20 = 0.20 6/20 = 0.30 20/20 = 1.00 All are +, and sum = 1

xi fi p(xi ) = fi /n n 1 In general… “Density” = Rel freq / width Data Frequency fi Relative Frequency p(xi ) = fi /n Total n 1

xi fi p(xi ) = fi /n n 1 In general… “Density” Data Frequency Relative Frequency p(xi ) = fi /n Total n 1

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Data values xi Frequency fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequency 2/20 = 0.10 8/20 = 0.40 4/20 = 0.20 6/20 = 0.30 20/20 = 1.00 0.10 0.40 0.20 0.30 x = c(60, 70, 80, 90) f = c(2, 8, 4, 6) sample = rep(x, f) hist(sample, freq = F, breaks = c(50, 55, 65, 75, 85, 95, 100), labels = T, col = "lightblue") Total Area = 1!

“Endpoint convention” Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} 4 values 8 values 5 values 2 values 1 value From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a frequency table. Frequency Histogram Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 4 8 2 5 1 “Endpoint convention” Here, the left endpoint is included, but not the right. Note!... Stay away from “10-20,” “20-30,” “30-40,” etc. Suggests population may be skewed to the right (i.e., positively skewed). In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.”

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 “0.00 of the sample is under 10 yrs old” 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 “0.20 of the sample is under 20 yrs old” “0.60 of the sample is under 30 yrs old” “0.85 of the sample is under 40 yrs old” “0.95 of the sample is under 50 yrs old” “1.00 of the sample is under 60 yrs old” Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 (Not a histogram!) “staircase graph” from 0 to 1 Relative frequencies are always between 0 and 1, and sum to 1. Cumulative relative frequencies always increase from 0 to 1.

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 “staircase graph” from 0 to 1 (Not a histogram!) Relative frequencies are always between 0 and 1, and sum to 1. Cumulative relative frequencies always increase from 0 to 1. But alas, there is a major problem….

Relative Frequency Histogram Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older, and wish to “lump” them into a single class interval. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 What effect will this have on the histogram? .40 0.4 0.3 0.2 0.1 0.0 Class Interval [10, 20) [20, 30) [30, 60) Total Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Relative Frequency 4/20 = 0.20 8/20 = 0.40 20/20 = 1.00 The skew no longer appears. The histogram is distorted because of the presence of an outlier (59) in the data, creating the need for unequal class widths.

Outliers What are they? How do they arise? What can we do about them? (A Pain in the Tuches) What are they? Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values. How do they arise? experimental error measurement error recording error not an error; genuine What can we do about them? double-check them if possible delete them? include them… somehow perform analysis both ways

Exercise: What if the outlier were 99 instead of 59? IDEA: Instead of having height of each class rectangle = relative frequency, make... area of each class rectangle = relative frequency. height × = “Density” width = relative frequency / Density Histogram 0.02 0.04 0.0133… 0.20 0.40 Class Interval Relative Frequency [10, 20) 0.20 [20, 30) 0.40 [30, 60) Total 1.00 Density (= height) 0.20/10 = 0.020 0.40/10 = 0.040 0.40/30 = 0.013 Total Area = 1! width = 10 width = 10 width = 30 The outlier is included, and the overall skewed appearance is restored. Exercise: What if the outlier were 99 instead of 59?

0.02 0.40 0.20 Density Histogram [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.0133… 0.20 0.40 0.04 Density Histogram Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.01333 0.02 0.04 0.20 0.40 Question: Approx what proportion of the sample is between 18-24 yrs old (inclusive)? Step 1. Identify the intervals & rectangles. Step 2. Split the FIRST rectangle at 18 as shown. Step 3. Observe that… the interval [18, 20) has width = 2 years the interval [10, 20) has width = 10 years. The ratio = 2/10 = 1/5. Step 4. Therefore, the red area = 1/5 of .20 = .04. Step 5. Repeat Steps 2-4 for SECOND rectangle at 24. The red area = 2/5 of .40 = .16. Step 6. ADD: .04 + .16 = .20 i.e., 20%

0.02 0.40 0.20 Density Histogram [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.0133… 0.20 0.40 0.04 Density Histogram Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.01333 0.02 0.04 0.20 0.40 Question: Approx what proportion of the sample is between 18-24 yrs old (inclusive)? Step 1. Identify the intervals & rectangles. - OR - Step 2. Use “Density = Area / Width” (see page 2.3-5 of the posted Lecture Notes): FIRST area = Width  Density = (20 – 18)(.02) = .04 SECOND area = Width  Density = (24 – 20)(.04) = .16 Step 3. ADD: .04 + .16 = .20 i.e., 20% Exercise: Confirm that the actual proportion = 30%. Exercise: What if ages 23, 24 were both changed to 25?

 xi fi “Measures of ” Center sample mode most frequent value = 70 Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mode most frequent value = 70 sample median “middle” value = (70 + 80) / 2 = 75 sample mean average value = Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Useful when outliers are present, e.g., employee salaries + CEO Quartiles are found similarly: Q1 = 70, Q2 = 75, Q3 = 90 Quintiles, deciles, other percentiles (= quantiles) similar. 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77  xi fi x =

 xi fi “Measures of Center” sample mode most frequent value = 70 Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mode most frequent value = 70 sample median “middle” value = (70 + 80) / 2 = 75 sample mean average value = Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77 x =  xi fi

“weighted” sample mean (with weights = rel freqs) “Measures of Center” Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mean 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) 2 20 8 20 4 20 6 20 Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequencies p(xi ) = fi /n 2/20 = 0.1 8/20 = 0.4 4/20 = 0.2 6/20 = 0.3 20/20 = 1.0 x =  xi fi x =  xi p (xi) “weighted” sample mean (with weights = rel freqs) “Notation, notation, notation.”

… but how do we measure the “spread” of a set of values? “Measures of ” Spread Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} … but how do we measure the “spread” of a set of values? sample mean First attempt: sample range = xn – x1 = 90 – 60 = 30. Simple, but… Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value. Example: Company employee salaries, including CEO Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Can modify with… sample interquartile range (IQR) = Q3 – Q1 = 90 – 70 = 20. We would still prefer a measure that uses all of the data.

“Measures of Spread” sample mean Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} … but how do we measure the “spread” of a set of values? sample mean Better attempt: Calculate the average of the “deviations from the mean.”  (xi – x) fi = 1/20 [(–17)(2) + (–7)(8) + (3)(4) + (13)(6)] = 0. ???????? This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability. Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Deviations from mean xi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +3 90 – 77 = +13 * The sample mean is a “balance point” for the data. 0.10 0.40 0.20 0.30 Question: Why wouldn’t the median 75 be the balance point? See Prob 2.5 / 11 in Lec Notes for a more obvious example.

“typical” distance from mean “Measures of Spread” Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mean “typical” sample value a modified Calculate the average of the “squared deviations from the mean.” [(–17) 2 (2) + (–7) 2 (8) + (3) 2 (4) + (13) 2 (6)] 1/19 = 106.316 sample variance Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Deviations from mean xi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +3 90 – 77 = +13  (xi – x) 2 fi s 2 = sample standard deviation s = “typical” distance from mean s = 10.311

Grouped Data - revisited Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) Use the interval midpoints for

Grouped Data - revisited Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) 15 25 45 Use the interval midpoints for Compare this “grouped mean” with the actual sample mean.

Grouped Data - revisited Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.01333 Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) Use the interval midpoints for 0.02 0.04 0.0133… 0.20 0.40 Compare this “grouped mean” with the actual sample mean. median Q2 = ? 0.3 0.1 Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below.

Grouped Data - revisited 00 Grouped Data - revisited 0.1 0.3 0.1 0.1 0.1 Use the interval midpoints for Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. …OR… Step 3. Observe that this rectangle can be split into 4 strips of 0.1 each. 22.5 25 27.5 Step 4. Thus, split the interval into 4 equal parts, each of width (30 – 20 )/4 = 2.5 years.

Grouped Data - revisited 00 0.3 0.1 Grouped Data - revisited Use the interval midpoints for Other percentiles are done similarly. Solve using cumul dist, w/o histogram …see posted Lecture Notes! Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. …OR… Step 3. Set up a proportion and solve for Q: …OR… Label as shown, and use the formula .

Comments is an unbiased estimator of the population mean , s 2 is an unbiased estimator of the population variance  2. (Their “expected values” are  and  2, respectively.) Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2. The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations xi – , because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.” A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See lecture notes appendix… c b a