1.1 - Populations, Samples and Processes 1.2 - Pictorial and Tabular Methods in Descriptive Statistics 1.3 - Measures of Location 1.4 - Measures of Variability.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
DESCRIBING DATA: 2. Numerical summaries of data using measures of central tendency and dispersion.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
PSY 307 – Statistics for the Behavioral Sciences
Chapter 3: Descriptive Measures STP 226: Elements of Statistics Jenifer Boshes Arizona State University.
Looking at data: distributions - Describing distributions with numbers
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Measures of Variability: Range, Variance, and Standard Deviation
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Describing Data: Numerical
Chapter 2 Describing Data with Numerical Measurements
Describing distributions with numbers
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
A Look at Means, Variances, Standard Deviations, and z-Scores
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
REPRESENTATION OF DATA.
Objectives 1.2 Describing distributions with numbers
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 1 Overview and Descriptive Statistics.
© 2008 Brooks/Cole, a division of Thomson Learning, Inc. 1 Chapter 4 Numerical Methods for Describing Data.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
Descriptive Statistics Measures of Variation. Essentials: Measures of Variation (Variation – a must for statistical analysis.) Know the types of measures.
1 MATB344 Applied Statistics Chapter 2 Describing Data with Numerical Measures.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Describing distributions with numbers
Lecture 3 Describing Data Using Numerical Measures.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Math 3680 Lecture #2 Mean and Standard Deviation.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Descriptive Statistics: Presenting and Describing Data.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
 The mean is typically what is meant by the word “average.” The mean is perhaps the most common measure of central tendency.  The sample mean is written.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Describing Distributions Numerically.
1 Chapter 4 Numerical Methods for Describing Data.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
Chapter 2 Descriptive Statistics
Introduction to statistics I Sophia King Rm. P24 HWB
Describing Samples Based on Chapter 3 of Gotelli & Ellison (2004) and Chapter 4 of D. Heath (1995). An Introduction to Experimental Design and Statistics.
Numerical descriptions of distributions
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
CHAPTER 1 Basic Statistics Statistics in Engineering
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Chapter 3 EXPLORATION DATA ANALYSIS 3.1 GRAPHICAL DISPLAY OF DATA 3.2 MEASURES OF CENTRAL TENDENCY 3.3 MEASURES OF DISPERSION.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA What trends can be determined from individual data sets?
1 - Introduction 2 - Exploratory Data Analysis 3 - Probability Theory 4 - Classical Probability Distributions 5 - Sampling Distrbns / Central Limit Theorem.
Business and Economics 6th Edition
Numerical descriptions of distributions
Chapter 1 Overview and Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Descriptive Statistics (Part 2)
Reasoning in Psychology Using Statistics
Descriptive Statistics: Presenting and Describing Data
NUMERICAL DESCRIPTIVE MEASURES
Descriptive Statistics
MEASURES OF CENTRAL TENDENCY
Descriptive and inferential statistics. Confidence interval
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Displaying Distributions with Graphs
Displaying and Summarizing Quantitative Data
Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.
Presentation transcript:

1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability 1 Chapter 1 Overview and Descriptive Statistics

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.” 4 values 8 values5 values 2 values 1 value From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a frequency table. {18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} Frequency Histogram Suggests population may be skewed to the right (i.e., positively skewed). Class IntervalFrequency [10, 20)4 [20, 30)8 [30, 40)5 [40, 50)2 [50, 60)1 Totaln = 20 “Endpoint convention” Here, the left endpoint is included, but not the right. Note!... Stay away from “10-20,” “20-30,” “30-40,” etc. 2

Class IntervalFrequency [10, 20)4 [20, 30)8 [30, 40)5 [40, 50)2 [50, 60)1 Totaln = 20 Relative Frequency 4/20 = /20 = /20 = /20 = /20 = /20 = 1.00 Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative frequencies are always between 0 and 1, and sum to 1. Relative Frequency Histogram

Class IntervalFrequency [10, 20)4 [20, 30)8 [30, 40)5 [40, 50)2 [50, 60)1 Totaln = 20 Relative Frequency 4/20 = /20 = /20 = /20 = /20 = /20 = 1.00 Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative frequencies are always between 0 and 1, and sum to 1. Relative Frequency Histogram “0.20 of the sample is under 20 yrs old” “0.60 of the sample is under 30 yrs old” “0.85 of the sample is under 40 yrs old” “0.95 of the sample is under 50 yrs old” “1.00 of the sample is under 60 yrs old” “0.00 of the sample is under 10 yrs old” Cumulative (0.00)

Example: Exactly what proportion of the sample is under 34 years old? Approximately Class IntervalFrequency [10, 20)4 [20, 30)8 [30, 40)5 [40, 50)2 [50, 60)1 Totaln = 20 Relative Frequency 4/20 = /20 = /20 = /20 = /20 = /20 = 1.00 Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative frequencies are always between 0 and 1, and sum to 1. Relative Frequency Histogram Cumulative (0.00) Cumulative relative frequencies always increase from 0 to 1. Solution: [30, 34) contains 4/10 of 0.25 = 0.1, [0, 30) contains 0.6, 0.7 sum = 0.7

Class IntervalFrequency [10, 20)4 [20, 30)8 [30, 40)5 [40, 50)2 [50, 60)1 Totaln = 20 Relative Frequency 4/20 = /20 = /20 = /20 = /20 = /20 = 1.00 Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages. {18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59} Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓ Relative frequencies are always between 0 and 1, and sum to 1. Relative Frequency Histogram Cumulative (0.00) Cumulative relative frequencies always increase from 0 to 1. Solution: [30, 34) contains 4/10 of 0.25 = 0.1, [0, 30) contains 0.6, 0.7 sum = 0.7 Example: Approximately what proportion of the sample is under 34 years old?Exactly But alas, there is a major problem….

Relative Frequency Histogram Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older, and wish to “lump” them into a single class interval. {18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, What effect will this have on the histogram? Class IntervalFrequency [10, 20)4 [20, 30)8 [30, 40)5 [40, 50)2 [50, 60)1 Totaln = 20 Relative Frequency 4/20 = /20 = /20 = /20 = /20 = /20 = values 8 values 31, 35, 35, 37, 38, 42, 46, 59} Class Interval [10, 20) [20, 30) [30, 60) Total Relative Frequency 4/20 = /20 = /20 = The skew no longer appears. The histogram is distorted because of the presence of an outlier (59) in the data, creating the need for unequal class widths. 8 values

What are they? Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values. How do they arise? o experimental error o measurement error o recording error o not an error; genuine What can we do about them? o double-check them if possible o delete them? o include them… somehow o perform analysis both ways (A Pain in the Tuches) 8

IDEA: Instead of having height of each class rectangle = relative frequency, make... area of each class rectangle = relative frequency. Class Interval Relative Frequency [10, 20) 0.20 [20, 30) 0.40 [30, 60) 0.40 Total20/20 = 1.00 Density (= height) 0.20/10 = /10 = /30 = height“Density” = relative frequency × width/ width = 10 width = 30 Density Histogram … Total Area = 1! 9 The outlier is included, and the overall skewed appearance is restored. Exercise: What if the outlier was 99 instead of 59?

1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability 10 Chapter 1 Overview and Descriptive Statistics

“Measures of ” Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100} sample mode most frequent value = 80 sample median “middle” value = ( ) / 2 = 85 sample mean average value = 11 Data values x i Frequencies f i Totaln = 10 i = 1 i = 2 i = 3 i = 4 (70)(1) + (80)(4) + (90)(2) + (100)(3) x =  x i f i = 87 (Quartiles are found similarly: Q 1 =, Q 2 = 85, Q 3 = )80100 Center 1/10

sample mode most frequent value = 80 sample median “middle” value = ( ) / 2 = 85 sample mean average value = “Measures of Center” Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100} 12 Data values x i Frequencies f i Totaln = 10 (70)(1) + (80)(4) + (90)(2) + (100)(3)1/10 = 87 x =  x i f i

sample mean “Measures of Center” Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100} 13 Data values x i Frequencies f i Totaln = 10 Relative Frequencies f (x i ) = f i /n 1/10 = 0.1 4/10 = 0.4 2/10 = 0.2 3/10 = /10 = 1.0 (70)(1) + (80)(4) + (90)(2) + (100)(3)1/10 x =  x i f (x i ) “Notation, notation, notation.” (70)(1) + (80)(4) + (90)(2) + (100)(3) =1/10 87 x =  x i f i “weighted” sample mean

sample mean 14 Data values x i Frequencies f i Totaln = 10 … but how do we measure the “spread” of a set of values? First attempt: sample range = x n – x 1 = 100 – 70 = 30. Simple, but… Spread “Measures of ” Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100} Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value. Example: Company employee salaries, including CEO Can modify with… sample interquartile range (IQR) = Q 3 – Q 1 = 100 – 80 = 20. We would still prefer a measure that uses all of the data.

Deviations from mean x i – x 70 – 87 = –17 80 – 87 = –7 90 – 87 = – 87 = +13 sample mean 15 Data values x i Frequencies f i Totaln = 10 … but how do we measure the “spread” of a set of values? Better attempt: Calculate the average of the “deviations from the mean.” 1/10 [ (–17)(1) + (–7)(4) + (3)(2) + (13)(3) ] = 0. ???????? This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability. Spread “Measures of ” Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}  (x i – x) f i = * Physically, the sample mean is a “balance point” for the data.

Deviations from mean x i – x 70 – 87 = –17 80 – 87 = –7 90 – 87 = – 87 = +13 sample mean 16 Data values x i Frequencies f i Totaln = 10  (x i – x) 2 f i [ (–17) 2 (1) + (–7) 2 (4) + (3) 2 (2) + (13) 2 (3) ] Calculate the “Measures of Spread” Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100} s 2 = sample variance sample standard deviation s = 1/9 = average of the “squared deviations from the mean.” s = a modified “typical” sample value “typical” distance from mean

Grouped Data - revisited 17 Class Interval Absolute Frequency [10, 20)4 [20, 30)8 [30, 60)8 Use the interval midpoints for

Grouped Data - revisited 18 Class Interval Absolute Frequency [10, 20)4 [20, 30)8 [30, 60) Use the interval midpoints for Compare this “grouped mean” with the actual mean.

Class Interval Absolute Frequency [10, 20)4 [20, 30)8 [30, 60)8 Grouped Data - revisited 19 Use the interval midpoints for median Q 2 = ? Compare this “grouped mean” with the actual mean. Class Interval Absolute Frequency Relative Frequency Density [10, 20) [20, 30) [30, 60) … Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below

Grouped Data - revisited Use the interval midpoints for median Q 2 = ? Compare this “grouped mean” with the actual mean. Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. Step 3. Observe that this rectangle can be split into 4 strips of 0.1 each Step 4. Thus, split the interval into 4 equal parts, each of width (30 – 20 )/4. …OR…

Grouped Data - revisited Use the interval midpoints for median Q 2 = ? Compare this “grouped mean” with the actual mean. Step 1. Identify the interval & rectangle. Step 2. Split the rectangle so that 0.5 area lies above and below. Step 3. Set up a proportion and solve for Q: Label as shown, and use the formula. …OR… Other percentiles are done similarly. Solve using cumul dist, w/o histogram. Solve for areas, given Q. See posted Lecture Notes! Other percentiles are done similarly. Solve using cumul dist, w/o histogram. Solve for areas, given Q. See posted Lecture Notes! …OR…

Comments is an unbiased estimator of the population mean , s 2 is an unbiased estimator of the population variance  2. (Their “expected values” are  and  2, respectively.) Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2. The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations x i –, because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.” A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See lecture notes appendix… 22 a c b