Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot… R code: x = c(60, 70, 80, 90) freq = c(2, 8, 4, 6) sample = rep(x, freq) stripchart(sample, method = "stack", pch = 19, offset = 1, ylim = range(1, 8)) Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot… Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20: “Density” Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequencies 2/20 = 0.10 8/20 = 0.40 4/20 = 0.20 6/20 = 0.30 20/20 = 1.00 All are +, and sum = 1

xi fi p(xi ) = fi /n n 1 In general… “Density” = Rel freq / width Data
Frequency fi Relative Frequency p(xi ) = fi /n Total n 1

xi fi p(xi ) = fi /n n 1 In general… “Density” Data Frequency
Relative Frequency p(xi ) = fi /n Total n 1

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Data values xi Frequency fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequency 2/20 = 0.10 8/20 = 0.40 4/20 = 0.20 6/20 = 0.30 20/20 = 1.00 0.10 0.40 0.20 0.30 x = c(60, 70, 80, 90) f = c(2, 8, 4, 6) sample = rep(x, f) hist(sample, freq = F, breaks = c(50, 55, 65, 75, 85, 95, 100), labels = T, col = "lightblue") Total Area = 1!

“Endpoint convention”
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} 4 values values 5 values values 1 value From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a frequency table. Frequency Histogram Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 4 8 2 5 1 “Endpoint convention” Here, the left endpoint is included, but not the right. Note!... Stay away from “10-20,” “20-30,” “30-40,” etc. Suggests population may be skewed to the right (i.e., positively skewed). In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.”

Relative Frequency Histogram
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Relative frequencies are always between 0 and 1, and sum to 1.

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 “0.00 of the sample is under 10 yrs old” 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 “0.20 of the sample is under 20 yrs old” “0.60 of the sample is under 30 yrs old” “0.85 of the sample is under 40 yrs old” “0.95 of the sample is under 50 yrs old” “1.00 of the sample is under 60 yrs old” Relative frequencies are always between 0 and 1, and sum to 1.

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 (Not a histogram!) “staircase graph” from 0 to 1 Relative frequencies are always between 0 and 1, and sum to 1. Cumulative relative frequencies always increase from 0 to 1.

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 “staircase graph” from 0 to 1 (Not a histogram!) Relative frequencies are always between 0 and 1, and sum to 1. Cumulative relative frequencies always increase from 0 to 1. But alas, there is a major problem….

Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older, and wish to “lump” them into a single class interval. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 What effect will this have on the histogram? .40 0.4 0.3 0.2 0.1 0.0 Class Interval [10, 20) [20, 30) [30, 60) Total Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Relative Frequency 4/20 = 0.20 8/20 = 0.40 20/20 = 1.00 The skew no longer appears. The histogram is distorted because of the presence of an outlier (59) in the data, creating the need for unequal class widths.

Outliers What are they? How do they arise? What can we do about them?
(A Pain in the Tuches) What are they? Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values. How do they arise? experimental error measurement error recording error not an error; genuine What can we do about them? double-check them if possible delete them? include them… somehow perform analysis both ways

Exercise: What if the outlier were 99 instead of 59?
IDEA: Instead of having height of each class rectangle = relative frequency, make area of each class rectangle = relative frequency. height × = “Density” width = relative frequency / Density Histogram 0.02 0.04 0.0133… 0.20 0.40 Class Interval Relative Frequency [10, 20) 0.20 [20, 30) 0.40 [30, 60) Total 1.00 Density (= height) 0.20/10 = 0.020 0.40/10 = 0.040 0.40/30 = 0.013 Total Area = 1! width = 10 width = 10 width = 30 The outlier is included, and the overall skewed appearance is restored. Exercise: What if the outlier were 99 instead of 59?

0.02 0.40 0.20 Density Histogram [10, 20) 4 0.20 0.020 [20, 30) 8 0.40
0.0133… 0.20 0.40 0.04 Density Histogram Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.02 0.04 0.20 0.40 Question: Approx what proportion of the sample is between yrs old (inclusive)? Step 1. Identify the intervals & rectangles. Step 2. Split the FIRST rectangle at 18 as shown. Step 3. Observe that… the interval [18, 20) has width = 2 years the interval [10, 20) has width = 10 years. The ratio = 2/10 = 1/5. Step 4. Therefore, the red area = 1/5 of .20 = .04. Step 5. Repeat Steps 2-4 for SECOND rectangle at 24. The red area = 2/5 of .40 = .16. Step 6. ADD: = .20 i.e., 20%

0.02 0.40 0.20 Density Histogram [10, 20) 4 0.20 0.020 [20, 30) 8 0.40
0.0133… 0.20 0.40 0.04 Density Histogram Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.02 0.04 0.20 0.40 Question: Approx what proportion of the sample is between yrs old (inclusive)? Step 1. Identify the intervals & rectangles. - OR - Step 2. Use “Density = Area / Width” (see page of the posted Lecture Notes): FIRST area = Width  Density = (20 – 18)(.02) = .04 SECOND area = Width  Density = (24 – 20)(.04) = .16 Step 3. ADD: = .20 i.e., 20% Exercise: Confirm that the actual proportion = 30%. Exercise: What if ages 23, 24 were both changed to 25?

 xi fi “Measures of ” Center sample mode most frequent value = 70
Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mode most frequent value = 70 sample median “middle” value = ( ) / 2 = 75 sample mean average value = Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Useful when outliers are present, e.g., employee salaries + CEO Quartiles are found similarly: Q1 = 70, Q2 = 75, Q3 = 90 Quintiles, deciles, other percentiles (= quantiles) similar. 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77  xi fi x =

 xi fi “Measures of Center” sample mode most frequent value = 70
Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mode most frequent value = 70 sample median “middle” value = ( ) / 2 = 75 sample mean average value = Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77 x =  xi fi

“weighted” sample mean (with weights = rel freqs)
“Measures of Center” Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mean 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) 2 20 8 20 4 20 6 20 Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequencies p(xi ) = fi /n 2/20 = 0.1 8/20 = 0.4 4/20 = 0.2 6/20 = 0.3 20/20 = 1.0 x =  xi fi x =  xi p (xi) “weighted” sample mean (with weights = rel freqs) “Notation, notation, notation.”

… but how do we measure the “spread” of a set of values?
“Measures of ” Spread Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} … but how do we measure the “spread” of a set of values? sample mean First attempt: sample range = xn – x1 = 90 – 60 = Simple, but… Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value. Example: Company employee salaries, including CEO Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Can modify with… sample interquartile range (IQR) = Q3 – Q1 = 90 – 70 = 20. We would still prefer a measure that uses all of the data.

“Measures of Spread” sample mean
Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} … but how do we measure the “spread” of a set of values? sample mean Better attempt: Calculate the average of the “deviations from the mean.”  (xi – x) fi = 1/20 [(–17)(2) + (–7)(8) + (3)(4) + (13)(6)] = 0. ???????? This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability. Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Deviations from mean xi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +3 90 – 77 = +13 * The sample mean is a “balance point” for the data. 0.10 0.40 0.20 0.30 Question: Why wouldn’t the median 75 be the balance point? See Prob 2.5 / 11 in Lec Notes for a more obvious example.

“typical” distance from mean
“Measures of Spread” Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mean “typical” sample value a modified Calculate the average of the “squared deviations from the mean.” [(–17) 2 (2) + (–7) 2 (8) + (3) 2 (4) + (13) 2 (6)] 1/19 = sample variance Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Deviations from mean xi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +3 90 – 77 = +13  (xi – x) 2 fi s 2 = sample standard deviation s = “typical” distance from mean s =

Grouped Data - revisited
Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) Use the interval midpoints for

Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) 15 25 45 Use the interval midpoints for Compare this “grouped mean” with the actual sample mean.

Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) Use the interval midpoints for 0.02 0.04 0.0133… 0.20 0.40 Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below.

00 Grouped Data - revisited 0.1 0.3 0.1 0.1 0.1 Use the interval midpoints for Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. …OR… Step 3. Observe that this rectangle can be split into 4 strips of 0.1 each. 22.5 25 27.5 Step 4. Thus, split the interval into 4 equal parts, each of width (30 – 20 )/4 = 2.5 years.

00 Grouped Data - revisited Use the interval midpoints for Other percentiles are done similarly. Solve using cumul dist, w/o histogram …see posted Lecture Notes! Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. …OR… Step 3. Set up a proportion and solve for Q: …OR… Label as shown, and use the formula .

Comments is an unbiased estimator of the population mean , s 2 is an unbiased estimator of the population variance  2. (Their “expected values” are  and  2, respectively.) Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2. The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations xi – , because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.” A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See lecture notes appendix… c b a

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.

Similar presentations

Presentation on theme: "Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.

Similar presentations

Presentation on theme: "Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are."— Presentation transcript:

Similar presentations

About project

Feedback