Chapter 1 Overview and Descriptive Statistics

Slides:



Advertisements
Similar presentations
Chapter 2 Describing Data with Numerical Measurements
Advertisements

Describing distributions with numbers
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 1 Overview and Descriptive Statistics.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.
Describing distributions with numbers
Descriptive Statistics: Presenting and Describing Data.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Introduction to statistics I Sophia King Rm. P24 HWB
Numerical descriptions of distributions
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA What trends can be determined from individual data sets?
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
Exploratory Data Analysis
1 - Introduction 2 - Exploratory Data Analysis 3 - Probability Theory 4 - Classical Probability Distributions 5 - Sampling Distrbns / Central Limit Theorem.
Virtual University of Pakistan
Descriptive Statistics ( )
Continuous random variables
Exploratory Data Analysis
Descriptive Statistics Measures of Variation
Chapter 1: Exploring Data
Business and Economics 6th Edition
Chapter 1: Exploring Data
Numerical descriptions of distributions
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Measures of Dispersion
Chapter 2: Methods for Describing Data Sets
Numerical Descriptive Measures
Stat 2411 Statistical Methods
Descriptive Statistics (Part 2)
Objective: Given a data set, compute measures of center and spread.
Reasoning in Psychology Using Statistics
Chapter 6 ENGR 201: Statistics for Engineers
Descriptive Statistics: Presenting and Describing Data
NUMERICAL DESCRIPTIVE MEASURES
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Descriptive Statistics
MEASURES OF CENTRAL TENDENCY
DAY 3 Sections 1.2 and 1.3.
Histograms: Earthquake Magnitudes
Descriptive and inferential statistics. Confidence interval
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Describing Data with Numerical Measures
Warmup Draw a stemplot Describe the distribution (SOCS)
Displaying Distributions with Graphs
Displaying and Summarizing Quantitative Data
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
Honors Stats Chapter 4 Part 6
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.
Chapter 1: Exploring Data
Numerical Descriptive Measures
Chapter 1: Exploring Data
Measures of Center.
Honors Statistics Review Chapters 4 - 5
Good morning! Please get out your homework for a check.
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Business and Economics 7th Edition
Presentation transcript:

Chapter 1 Overview and Descriptive Statistics 1.1 - Populations, Samples and Processes 1.2 - Pictorial and Tabular Methods in Descriptive Statistics 1.3 - Measures of Location 1.4 - Measures of Variability

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot… R code: x = c(60, 70, 80, 90) freq = c(2, 8, 4, 6) sample = rep(x, freq) stripchart(sample, method = "stack", pch = 19, offset = 1, ylim = range(1, 8)) Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot… Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20: “Density” Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequencies 2/20 = 0.10 8/20 = 0.40 4/20 = 0.20 6/20 = 0.30 20/20 = 1.00 All are +, and sum = 1

xi fi p(xi ) = fi /n n 1 In general… “Density” = Rel freq / width Data Frequency fi Relative Frequency p(xi ) = fi /n Total n 1

xi fi p(xi ) = fi /n n 1 In general… “Density” Data Frequency Relative Frequency p(xi ) = fi /n Total n 1

Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Data values xi Frequency fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequency 2/20 = 0.10 8/20 = 0.40 4/20 = 0.20 6/20 = 0.30 20/20 = 1.00 0.10 0.40 0.20 0.30 x = c(60, 70, 80, 90) f = c(2, 8, 4, 6) sample = rep(x, f) hist(sample, freq = F, breaks = c(50, 55, 65, 75, 85, 95, 100), labels = T, col = "lightblue") Total Area = 1!

“Endpoint convention” Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} 4 values 8 values 5 values 2 values 1 value From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a frequency table. Frequency Histogram Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 4 8 2 5 1 “Endpoint convention” Here, the left endpoint is included, but not the right. Note!... Stay away from “10-20,” “20-30,” “30-40,” etc. Suggests population may be skewed to the right (i.e., positively skewed). In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.”

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 “0.00 of the sample is under 10 yrs old” 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 “0.20 of the sample is under 20 yrs old” “0.60 of the sample is under 30 yrs old” “0.85 of the sample is under 40 yrs old” “0.95 of the sample is under 50 yrs old” “1.00 of the sample is under 60 yrs old” Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 (Not a histogram!) “staircase graph” from 0 to 1 Relative frequencies are always between 0 and 1, and sum to 1. Cumulative relative frequencies always increase from 0 to 1.

Relative Frequency Histogram Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 0.4 0.3 0.2 0.1 0.0 Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Cumulative (0.00) 0.20 0.60 0.85 0.95 1.00 “staircase graph” from 0 to 1 (Not a histogram!) Relative frequencies are always between 0 and 1, and sum to 1. Cumulative relative frequencies always increase from 0 to 1. But alas, there is a major problem….

Relative Frequency Histogram Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older, and wish to “lump” them into a single class interval. {10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative Frequency Histogram .20 .40 .10 .25 .05 What effect will this have on the histogram? .40 0.4 0.3 0.2 0.1 0.0 Class Interval [10, 20) [20, 30) [30, 60) Total Class Interval Frequency [10, 20) 4 [20, 30) 8 [30, 40) 5 [40, 50) 2 [50, 60) 1 Total n = 20 Relative Frequency 4/20 = 0.20 8/20 = 0.40 5/20 = 0.25 2/20 = 0.10 1/20 = 0.05 20/20 = 1.00 Relative Frequency 4/20 = 0.20 8/20 = 0.40 20/20 = 1.00 The skew no longer appears. The histogram is distorted because of the presence of an outlier (59) in the data, creating the need for unequal class widths.

Outliers What are they? How do they arise? What can we do about them? (A Pain in the Tuches) What are they? Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values. How do they arise? experimental error measurement error recording error not an error; genuine What can we do about them? double-check them if possible delete them? include them… somehow perform analysis both ways

Exercise: What if the outlier were 99 instead of 59? IDEA: Instead of having height of each class rectangle = relative frequency, make... area of each class rectangle = relative frequency. height × = “Density” width = relative frequency / Density Histogram 0.02 0.04 0.0133… 0.20 0.40 Class Interval Relative Frequency [10, 20) 0.20 [20, 30) 0.40 [30, 60) Total 1.00 Density (= height) 0.20/10 = 0.020 0.40/10 = 0.040 0.40/30 = 0.013 Total Area = 1! width = 10 width = 10 width = 30 The outlier is included, and the overall skewed appearance is restored. Exercise: What if the outlier were 99 instead of 59?

0.02 0.40 0.20 Density Histogram [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.0133… 0.20 0.40 0.04 Density Histogram Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.01333 0.02 0.04 0.20 0.40 Question: Approx what proportion of the sample is between 18-24 yrs old (inclusive)? Step 1. Identify the intervals & rectangles. Step 2. Split the FIRST rectangle at 18 as shown. Step 3. Observe that… the interval [18, 20) has width = 2 years the interval [10, 20) has width = 10 years. The ratio = 2/10 = 1/5. Step 4. Therefore, the red area = 1/5 of .20 = .04. Step 5. Repeat Steps 2-4 for SECOND rectangle at 24. The red area = 2/5 of .40 = .16. Step 6. ADD: .04 + .16 = .20 i.e., 20%

0.02 0.40 0.20 Density Histogram [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.0133… 0.20 0.40 0.04 Density Histogram Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.01333 0.02 0.04 0.20 0.40 Question: Approx what proportion of the sample is between 18-24 yrs old (inclusive)? Step 1. Identify the intervals & rectangles. - OR - Step 2. Use “Density = Area / Width” (see page 2.3-5 of the posted Lecture Notes): FIRST area = Width  Density = (20 – 18)(.02) = .04 SECOND area = Width  Density = (24 – 20)(.04) = .16 Step 3. ADD: .04 + .16 = .20 i.e., 20% Exercise: Confirm that the actual proportion = 30%. Exercise: What if ages 23, 24 were both changed to 25?

Chapter 1 Overview and Descriptive Statistics 1.1 - Populations, Samples and Processes 1.2 - Pictorial and Tabular Methods in Descriptive Statistics 1.3 - Measures of Location 1.4 - Measures of Variability

 xi fi “Measures of ” Center sample mode most frequent value = 70 Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mode most frequent value = 70 sample median “middle” value = (70 + 80) / 2 = 75 sample mean average value = Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Useful when outliers are present, e.g., employee salaries + CEO Quartiles are found similarly: Q1 = 70, Q2 = 75, Q3 = 90 Quintiles, deciles, other percentiles (= quantiles) similar. 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77  xi fi x =

 xi fi “Measures of Center” sample mode most frequent value = 70 Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mode most frequent value = 70 sample median “middle” value = (70 + 80) / 2 = 75 sample mean average value = Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77 x =  xi fi

“weighted” sample mean (with weights = rel freqs) “Measures of Center” Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mean 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) = 77 1/20 (60)(2) + (70)(8) + (80)(4) + (90)(6) 2 20 8 20 4 20 6 20 Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Relative Frequencies p(xi ) = fi /n 2/20 = 0.1 8/20 = 0.4 4/20 = 0.2 6/20 = 0.3 20/20 = 1.0 x =  xi fi x =  xi p (xi) “weighted” sample mean (with weights = rel freqs) “Notation, notation, notation.”

… but how do we measure the “spread” of a set of values? “Measures of ” Spread Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} … but how do we measure the “spread” of a set of values? sample mean First attempt: sample range = xn – x1 = 90 – 60 = 30. Simple, but… Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value. Example: Company employee salaries, including CEO Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Can modify with… sample interquartile range (IQR) = Q3 – Q1 = 90 – 70 = 20. We would still prefer a measure that uses all of the data.

“Measures of Spread” sample mean Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} … but how do we measure the “spread” of a set of values? sample mean Better attempt: Calculate the average of the “deviations from the mean.”  (xi – x) fi = 1/20 [(–17)(2) + (–7)(8) + (3)(4) + (13)(6)] = 0. ???????? This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability. Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Deviations from mean xi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +3 90 – 77 = +13 * The sample mean is a “balance point” for the data. 0.10 0.40 0.20 0.30 Question: Why wouldn’t the median 75 be the balance point? See Prob 2.5 / 11 in Lec Notes for a more obvious example.

“typical” distance from mean “Measures of Spread” Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} sample mean “typical” sample value a modified Calculate the average of the “squared deviations from the mean.” [(–17) 2 (2) + (–7) 2 (8) + (3) 2 (4) + (13) 2 (6)] 1/19 = 106.316 sample variance Data values xi Frequencies fi 60 2 70 8 80 4 90 6 Total n = 20 Deviations from mean xi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +3 90 – 77 = +13  (xi – x) 2 fi s 2 = sample standard deviation s = “typical” distance from mean s = 10.311

Grouped Data - revisited Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) Use the interval midpoints for

Grouped Data - revisited Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) 15 25 45 Use the interval midpoints for Compare this “grouped mean” with the actual sample mean.

Grouped Data - revisited Class Interval Absolute Frequency Relative Frequency Density [10, 20) 4 0.20 0.020 [20, 30) 8 0.40 0.040 [30, 60) 0.01333 Class Interval Absolute Frequency [10, 20) 4 [20, 30) 8 [30, 60) Use the interval midpoints for 0.02 0.04 0.0133… 0.20 0.40 Compare this “grouped mean” with the actual sample mean. median Q2 = ? 0.3 0.1 Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below.

Grouped Data - revisited 00 Grouped Data - revisited 0.1 0.3 0.1 0.1 0.1 Use the interval midpoints for Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. …OR… Step 3. Observe that this rectangle can be split into 4 strips of 0.1 each. 22.5 25 27.5 Step 4. Thus, split the interval into 4 equal parts, each of width (30 – 20 )/4 = 2.5 years.

Grouped Data - revisited 00 0.3 0.1 Grouped Data - revisited Use the interval midpoints for Other percentiles are done similarly. Solve using cumul dist, w/o histogram …see posted Lecture Notes! Compare this “grouped mean” with the actual sample mean. median Q2 = ? Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. …OR… Step 3. Set up a proportion and solve for Q: …OR… Label as shown, and use the formula .

Comments is an unbiased estimator of the population mean , s 2 is an unbiased estimator of the population variance  2. (Their “expected values” are  and  2, respectively.) Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2. The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations xi – , because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.” A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See lecture notes appendix… c b a