Describing Data: One Variable

Slides:



Advertisements
Similar presentations
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Advertisements

Exploratory Data Analysis I
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 9/6/12 Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Describing Data: One Quantitative Variable
Chapter 3 Describing Data Using Numerical Measures
Sullivan – Statistics: Informed Decisions Using Data – 2 nd Edition – Chapter 3 Introduction – Slide 1 of 3 Topic 16 Numerically Summarizing Data- Averages.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter Two Treatment of Data.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 3 Describing Data Using Numerical Measures.
CHAPTER 2: Describing Distributions with Numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
STAT 250 Dr. Kari Lock Morgan
Methods for Describing Sets of Data
1 Excursions in Modern Mathematics Sixth Edition Peter Tannenbaum.
LECTURE 8 Thursday, 19 February STA291 Fall 2008.
1.3: Describing Quantitative Data with Numbers
Applied Quantitative Analysis and Practices LECTURE#08 By Dr. Osman Sadiq Paracha.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Describing distributions with numbers
Lecture 3 Describing Data Using Numerical Measures.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Categorical vs. Quantitative…
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
To be given to you next time: Short Project, What do students drive? AP Problems.
Chapter 3 Looking at Data: Distributions Chapter Three
Describing Quantitative Data with Numbers Section 1.3.
Organizing Data AP Stats Chapter 1. Organizing Data Categorical Categorical Dotplot (also used for quantitative) Dotplot (also used for quantitative)
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Describing Data: One Quantitative Variable SECTIONS 2.2, 2.3 One quantitative.
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
Describing Data: Two Variables
1.3 Describing Quantitative Data with Numbers Pages Objectives SWBAT: 1)Calculate measures of center (mean, median). 2)Calculate and interpret measures.
Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.
STAT 101: Day 5 Descriptive Statistics II 1/30/12 One Quantitative Variable (continued) Quantitative with a Categorical Variable Two Quantitative Variables.
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Synthesis and Review for Exam 1.
UNIT ONE REVIEW Exploring Data.
Describing Data: Two Variables
Chapter 1: Exploring Data
Description of Data (Summary and Variability measures)
Please take out Sec HW It is worth 20 points (2 pts
Topic 5: Exploring Quantitative data
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
One Quantitative Variable: Measures of Spread
Organizing Data AP Stats Chapter 1.
1.3 Describing Quantitative Data with Numbers
Chapter 1: Exploring Data
Exploratory Data Analysis
Chapter 1: Exploring Data
Honors Statistics Review Chapters 4 - 5
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Presentation transcript:

Describing Data: One Variable STAT 101 Dr. Kari Lock Morgan Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical variable (2.1) One quantitative variable (2.2, 2.3, 2.4)

Announcements Homework 1 due now – turn it in according to lab section Clicker grading starts today!

Why not always randomize? Randomized experiments are ideal, but sometimes not ethical or possible Often, you have to do the best you can with data from observational studies Example: research for the Supreme Court case as to whether preferences for minorities in university admissions helps or hurts the minority students

Randomization in Data Collection Was the explanatory variable randomly assigned? Was the sample randomly selected? Yes No Yes No Possible to generalize to the population Should not generalize to the population Possible to make conclusions about causality Can not make conclusions about causality

Two Fundamental Questions in Data Collection Random sample??? Population Sample Randomized experiment??? DATA

Randomization Doing a randomized experiment on a random sample is ideal, but rarely achievable If the focus of the study is using a sample to estimate a statistic for the entire population, you need a random sample, but do not need a randomized experiment (example: election polling) If the focus of the study is establishing causality from one variable to another, you need a randomized experiment and can settle for a non- random sample (example: drug testing)

Review from Last Class Association does not imply causation! In observational studies, confounding variables almost always exist, so causation cannot be established Randomized experiments involve randomly determining the level of the explanatory variable Randomized experiments prevent confounding variables, so causality can be inferred A control or comparison group is necessary The placebo effect exists, so a placebo and blinding should be used

Descriptive Statistics The Big Picture Population Sampling Sample Statistical Inference Descriptive Statistics

Descriptive Statistics In order to make sense of data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

One Categorical Variable A random sample of US adults in 2012 were surveyed regarding the type of cell phone owned Android? iPhone? Blackberry? Non- smartphone? No cell phone?

Cell Phones Which type of cell phone do you own? iPhone Blackberry Android iPhone Blackberry Non-smartphone No cell phone

Frequency Table US data: A frequency table shows the number of cases that fall in each category: Android 458 iPhone 437 Blackberry 141 Non Smartphone 924 No cell phone 293 Total 2253 R: table(x)

The proportion in a category is found by Proportion for a sample: 𝑝 (“p-hat”) Proportion for a population: p

Proportion What proportion of adults sampled do not own a cell phone? 𝑝 = 293 2253 =0.13 Android 458 iPhone 437 Blackberry 141 Non Smartphone 924 No cell phone 293 Total 2253 or 13% Proportions and percentages can be used interchangeably

Relative Frequency Table A relative frequency table shows the proportion of cases that fall in each category All the numbers in a relative frequency table sum to 1 Android 0.203 iPhone 0.194 Blackberry 0.063 Non Smartphone 0.410 No cell phone 0.130 R: table(x)/length(x)

Bar Chart/Plot/Graph In a bar chart, the height of the bar is the number of cases falling in each category R: barchart(x)

Pie Chart In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(x))

StatKey www.lock5stat.com/statkey

Summary: One Categorical Variable Summary Statistics Proportion Frequency table Relative frequency table Visualization Bar chart Pie chart

One Quantitative Variable World gross for all 2011 Hollywood movies HollywoodMovies2011 More graphics on profits for Hollywood movies

HollywoodMovies2011

Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case Highest is Harry Potter and the Deathly Hallows Part 2 Second is Transformers Third: Pirates of the Caribbean: On Stranger Tides units: Millions of $

Histogram The height of the each bar corresponds to the number of cases within that range of the variable R: hist(x)

Histogram vs Bar Chart This is a Histogram Bar chart Other I have no idea

Histogram vs Bar Chart This is a Histogram Bar chart Other I have no idea

Histogram vs Bar Chart A bar chart is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x- axis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars

Shape Long right tail Symmetric Right-Skewed Left-Skewed

Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable x x1 = 97.009, x2 = 201.897, x3 = 216.196, …

Mean The mean or average of the data values is 𝑚𝑒𝑎𝑛= 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑚𝑒𝑎𝑛= 𝑥 1 + 𝑥 2 +…+ 𝑥 𝑛 𝑛 = 𝑥 𝑛 Sample mean: 𝑥 Population mean:  (“mu”) R: mean(x)

Median The median, m, is the middle value when the data are ordered. If there are an even number of values, the median is the average of the two middle values. The median splits the data in half. R: median(x)

Measures of Center m = 76.66 Mean is “pulled” in the direction of skewness  =150.74 World Gross (in millions)

Skewness and Center A distribution is left-skewed. Which measure of center would you expect to be higher? Mean Median The mean will be pulled down towards the skewness (towards the long tail).

Outlier An outlier is an observed value that is notably distinct from the other values in a dataset.

World Gross (in millions) Outliers Transformers Harry Potter Pirates of the Caribbean World Gross (in millions)

Resistance A statistic is resistant if it is relatively unaffected by extreme values. The median is resistant while the mean is not. Mean Median With Harry Potter $150,742,300 $76,658,500 Without Harry Potter $141,889,900 $75,009,000

Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results

Standard Deviation The standard deviation for a quantitative variable measures the spread of the data 𝑠= 𝑥− 𝑥 2 𝑛−1 Sample standard deviation: s Population standard deviation:  (“sigma”) R: sd(x)

Standard Deviation The standard deviation gives a rough estimate of the typical distance of a data values from the mean The larger the standard deviation, the more variability there is in the data and the more spread out the data are

Standard Deviation Both of these distributions are bell-shaped

95% Rule If a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean. For a population, 95% of the data will be between µ – 2 and µ + 2

The 95% Rule

The 95% Rule The normal distribution app on statkey is a good way to demonstrate the 95% rule – ask students to pick any mean and sd, and then have them guess what the bounds for the middle 95% (can get by clicking on two tail) will be StatKey

The 95% Rule The standard deviation for hours of sleep per night is closest to ½ 1 2 4 I have no idea

The z-score for a data value, x, is 𝑧= 𝑥− 𝑥 𝑠 For a population, 𝑥 is replaced with µ and s is replaced with  Values farther from 0 are more extreme

z-score A z-score puts values on a common scale A z-score is the number of standard deviations a value falls from the mean 95% of all z-scores fall between what two values? z-scores beyond -2 or 2 can be considered extreme -2 and 2

z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT:  = 21,  = 5 SAT:  = 1500,  = 325 Assume ACT and SAT scores have approximately bell-shaped distributions ACT score of 28 SAT score of 2100 I don’t know

Other Measures of Location Maximum = largest data value Minimum = smallest data value Quartiles: Q1 = median of the values below m. Q3 = median of the values above m.

Five Number Summary Five Number Summary: Min Max Q1 Q3 m 25% R: summary(x)

Five Number Summary > summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00 The distribution of number of hours spent studying each week is Symmetric Right-skewed Left-skewed Impossible to tell

The Pth percentile is the value which is greater than P% of the data We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile

Five Number Summary Five Number Summary: Min Max Q1 Q3 m 25% 0th percentile 25th percentile 50th percentile 75th percentile 100th percentile

Measures of Spread Range = Max – Min Interquartile Range (IQR) = Q3 – Q1 Is the range resistant to outliers? Yes No Is the IQR resistant to outliers? The range depends entirely on the most extreme values. The IQR is based off the middle 50% of the data, which will not contain outliers.

Comparing Statistics Measures of Center: Measures of Spread: Mean (not resistant) Median (resistant) Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information

Outliers Outliers can be informally identified by looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartiles A data value is an outlier if it is Smaller than Q1 – 1.5(IQR) or Larger than Q3 + 1.5(IQR)

Boxplot Outliers Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier Q3 Median Q1 R: boxplot(x)

Boxplot Which boxplot goes with the histogram of waiting times for the bus? (a) (b) (c) The data do not show any low outliers.

StatKey www.lock5stat.com/statkey

Summary: One Quantitative Variable Summary Statistics Center: mean, median Spread: standard deviation, range, IQR Percentiles 5 number summary Visualization Dotplot Histogram Boxplot Other concepts Shape: symmetric, skewed, bell-shaped Outliers, resistance z-scores

To Do Read Sections 2.1, 2.2, 2.3, 2.4 Do HW 2 (due Wednesday, 1/29)