Topic 5: Exploring Quantitative data

Slides:

Advertisements

Similar presentations

Describing Quantitative Variables

Advertisements

DESCRIBING DISTRIBUTION NUMERICALLY

Chapter 2 Exploring Data with Graphs and Numerical Summaries

CHAPTER 4 Displaying and Summarizing Quantitative Data Slice up the entire span of values in piles called bins (or classes) Then count the number of values.

1 Chapter 1: Sampling and Descriptive Statistics.

It’s an outliar!.  Similar to a bar graph but uses data that is measured.

Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.

AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.

Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.

1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~

Categorical vs. Quantitative…

Chapter 5 Describing Distributions Numerically.

More Univariate Data Quantitative Graphs & Describing Distributions with Numbers.

MATH 2311 Section 1.5. Graphs and Describing Distributions Lets start with an example: Height measurements for a group of people were taken. The results.

Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.

(Unit 6) Formulas and Definitions:. Association. A connection between data values.

Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:

Introduction to Statistics

UNIT ONE REVIEW Exploring Data.

Introduction to Statistics

Prof. Eric A. Suess Chapter 3

Exploratory Data Analysis

Chapter 1: Exploring Data

Unit 4 Statistical Analysis Data Representations

4. Interpreting sets of data

Unit 6 Day 2 Vocabulary and Graphs Review

Chapter 6 ENGR 201: Statistics for Engineers

MATH 2311 Section 1.5.

Statistical Reasoning

Laugh, and the world laughs with you. Weep and you weep alone

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

Box and Whisker Plots Algebra 2.

DAY 3 Sections 1.2 and 1.3.

Histograms: Earthquake Magnitudes

Numerical Measures: Skewness and Location

Describing Distributions of Data

Drill {A, B, B, C, C, E, C, C, C, B, A, A, E, E, D, D, A, B, B, C}

Warmup Draw a stemplot Describe the distribution (SOCS)

Displaying Distributions with Graphs

Displaying and Summarizing Quantitative Data

CHAPTER 1 Exploring Data

Displaying and Summarizing Quantitative Data

Basic Practice of Statistics - 3rd Edition

Chapter 1: Exploring Data

Chapter 1: Exploring Data

Chapter 1: Exploring Data

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

Honors Statistics Review Chapters 4 - 5

Chapter 1: Exploring Data

Basic Practice of Statistics - 3rd Edition

Chapter 1: Exploring Data

CHAPTER 1 Exploring Data

Chapter 1: Exploring Data

CHAPTER 1 Exploring Data

CHAPTER 1 Exploring Data

Quantitative Data Who? Cans of cola. What? Weight (g) of contents.

CHAPTER 1 Exploring Data

Chapter 1: Exploring Data

Chapter 1: Exploring Data

Chapter 1: Exploring Data

Advanced Algebra Unit 1 Vocabulary

Chapter 1: Exploring Data

Chapter 1: Exploring Data

CHAPTER 1 Exploring Data

Chapter 1: Exploring Data

Types of variables. Types of variables Categorical variables or qualitative identifies basic differentiating characteristics of the population.

Chapter 1: Exploring Data

Presentation transcript:

Topic 5: Exploring Quantitative data

Dot plot, mean, and standard deviation

Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Variable Description Spam Specifies whether the email is spam Num_char Number of characters in email Line_breaks Number of line breaks in email Format Specifies whether email was in html or text format Number Indicates if email contained no number, a small number (under 1,000,000), or a big number

Data matrix for emails Quantitative variables Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Quantitative variables

Sample of email data Let’s consider a random sample of 50 emails from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 emails. Spam Num_char Line_breaks Format Number 1 no 2454 61 text small 2 41623 1088 html 3 57 5 . 50 15829 242

Dot plot A dot plot provides a case-by-case view of data for one quantitative variable. Num_char 1 2454 2 41623 3 57 . 50 15829

Dot plot A dot plot provides a case-by-case view of data for one quantitative variable. Num_char 1 2454 2 41623 3 57 . 50 15829

Dot plot and the mean The “placement” of data, as seen in a dot plot or some other representation, is called the distribution of the data. The mean (also called the average) is a common way to measure the center of the distribution. Mean for data below is 10.704

The mean The sample mean, denoted by , can be calculated as where represent the observed values.

Population mean and estimation The population mean is also computed the same way, but denoted by μ (the Greek letter mu). It is often not possible to compute μ because data on the entire population is not available. The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate is probably not perfect, but if the sample is representative of the population, it is usually a good estimate.

Distributions with the same mean Each dot plot displays 124 observations and the distributions all have a mean of 6. What makes them different?

Distributions with the same mean Order these distributions from the least spread out to the most spread out. A. B. C.

Standard Deviation The standard deviation is the typical distance of an observation from the mean. The mean of the distribution is = 6 and sample size is n = 124. The standard deviation is computed as follows:

Standard deviation measures spread A. Std. dev. = 1.361 The standards deviations of the three distributions are given. B. Std. dev. = 2.550 C. Std. dev. = 1.482

The standard deviation The standard deviation of a sample is denoted by s and can be calculated using the formula given on the previous slide. The standard deviation of the population is computed in a similar way, except we divide by n instead of n-1. The standard deviation of the population is denoted by σ (the Greek letter sigma).

Histograms and the shape of a distribution

Histogram A histogram plots binned counts as bars. Characters (in thousands) Count 0-5 19 5-10 12 10-15 6 15-20 3 20-25 25-30 5 30-35 35-40 40-45 2

Histograms A histogram is another way to display the distribution of a quantitative variable. Better than a stem-and-leaf plot for larger data sets, but doesn’t retain the actual numerical values. Basic Steps for Creating a Histogram Divide the range of the data (smallest to largest) into classes of equal width. The classes should not overlap. Count the number of observations that fall into each class. Recall that the counts are also called frequencies. Draw a horizontal axis and mark off the classes along this axis. The vertical axis can be the count, the proportion, or the percentage. Draw a rectangle (a vertical bar) above each class with the height equal to the count, the proportion, or the percentage.

Bin width: height of MAT 117 students Bin width can alter the story we get from the histogram. ½ in. bins 1 in. bins 6 in. bins 33 in. bins

Shape of a Distribution: Modality Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? Note: To determine modality, step back and imagine a smooth curve over the histogram – imagine the bars are wooden blocks and you drop a limp spaghetti noodle over them, the shape the spaghetti would take could be viewed as a smooth curve.

Modality: height of MAT 117 students Which bin width most accurately presents the modality? ½ in. bins 1 in. bins 6 in. bins 33 in. bins

Shape of a Distribution: Skewness Is the histogram right skewed, left skewed, or symmetric? Note: Histograms are said to be skewed to the side of the long tail.

Shape of a Distribution: Unusual Observations Are there any unusual observations or potential outliers

Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails?

Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails? Unimodal and right skewed, with a potentially unusual observation at 40,000 characters

Box plot and the five number summary

Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median 100-th percentile Maximum

Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3

Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median Second quartile Q2 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3

Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median Second quartile Q2 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3

Height of female MAT 117 students

Height of female MAT 117 students

Height of female MAT 117 students Median Q1 Q3 Max. Min. We want to graphically represent these five numbers, called the five-number summary. This graph is called a box plot. As you can see, there is a bit more to it than just these five numbers.

Box plot: height of female MAT 117 students

Anatomy of the box plot Median Lower whisker Upper whisker Potential outliers Q1 Q3 Potential outlier

IQR, whisker, and outliers Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range (IQR). IQR = Q3 – Q1 Whiskers of a box plot can extend up to 1.5 x IQR away from the the quartiles: Max upper whisker reach = Q3 + 1.5 x IQR Max lower whisker reach = Q1 – 1.5 x IQR A potential outlier is an observation beyond the maximum reach of the whiskers. It is an observation that appears to be extreme relative to the rest of the data.

Outliers Why is it important to look for outliers? Identify extreme skew in the distribution. Identify data collection and entry errors. Provide insight into interesting features of the data.

Resistant statistics

Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267

Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267

Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603

Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603

Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603

Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603 3K to 1,400K 76,300 130,564 40,262 71,888

Quantitative data pairs: scatterplots

Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables. Num_char Line_breaks 1 2454 61 2 41623 1088 3 57 5 . 50 15829 242

Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables. Num_char Line_breaks 1 2454 61 2 41623 1088 3 57 5 . 50 15829 242

Scatterplots: trends Linear trend Nonlinear trend

Scatterplots: trends (continued) Cluster trend No apparent trend

Categorical-quantitative data pairs: comparing groups

A categorical-quantitative data pair Typically the categorical variable is the explanatory variable, and the quantitative variable is the response variable: Explanatory: categorical variable Response: quantitative variable We want to compare the quantitative variable (its mean, median, etc.) for the different groups formed by the categorical variable.

Sample of email data Let’s consider a random sample of 50 emails from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 emails. Spam Num_char Line_breaks Format Number 1 no 2454 61 text small 2 41623 1088 html 3 57 5 . 50 15829 242 Quantitative Categorical

Number of characters and the format of emails The table below shows the mean and standard deviation for the number of characters in emails formatted as text or html. Number of Characters (in thousands) Mean Standard Deviation Text 2.308 3.626 HTML 14.862 13.711

Comparing box plots