Transformations, Z-scores, and Sampling September 21, 2011.

Slides:



Advertisements
Similar presentations
Chapter 2 Introductory Information and Basic Terms: Basic Paradigm PopulationSample Statistics Inference Parameters.
Advertisements

© 2012 W.H. Freeman and Company Lecture 7 – Sept Sampling designs We have a population we want to study. It is impractical to collect data on the.
The Normal distributions BPS chapter 3 © 2006 W.H. Freeman and Company.
HS 67 - Intro Health Stat The Normal Distributions
Standard Normal Table Area Under the Curve
Looking at data: distributions - Density curves and normal distributions IPS section 1.3 © 2006 W.H. Freeman and Company (authored by Brigitte Baldi, University.
Producing data: - Sampling designs and toward inference IPS chapters 3.3 and 3.4 © 2006 W.H. Freeman and Company.
1.2: Describing Distributions
Sampling Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
CHAPTER 3: The Normal Distributions Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
Normal Distribution Z-scores put to use!
BPS - 5th Ed. Chapter 31 The Normal Distributions.
The Normal distributions PSLS chapter 11 © 2009 W.H. Freeman and Company.
Objectives (BPS 3) The Normal distributions Density curves
1 Normal Distributions Heibatollah Baghi, and Mastee Badii.
Objectives (BPS chapter 11) Sampling distributions  Parameter versus statistic  The law of large numbers  What is a sampling distribution?  The sampling.
A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.
Chapter 5 Sampling Distributions
Modeling Distributions of Data
Producing data: - Sampling designs and toward inference IPS chapters 3.3 and 3.4 © 2006 W.H. Freeman and Company.
3.3 Density Curves and Normal Distributions
Looking at Data - Distributions Density Curves and Normal Distributions IPS Chapter 1.3 © 2009 W.H. Freeman and Company.
Objectives (BPS chapter 11) Sampling distributions  Parameter versus statistic  The law of large numbers  What is a sampling distribution?  The sampling.
Sampling distributions BPS chapter 11 © 2006 W. H. Freeman and Company.
Sampling distributions BPS chapter 11 © 2006 W. H. Freeman and Company.
Sampling distributions for sample means IPS chapter 5.2 © 2006 W.H. Freeman and Company.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 PROBABILITIES FOR CONTINUOUS RANDOM VARIABLES THE NORMAL DISTRIBUTION CHAPTER 8_B.
The Normal distributions BPS chapter 3 © 2006 W.H. Freeman and Company.
1 Normal Random Variables In the class of continuous random variables, we are primarily interested in NORMAL random variables. In the class of continuous.
Producing data: sampling BPS chapter 7 © 2006 W. H. Freeman and Company.
Stat 1510: Statistical Thinking and Concepts 1 Density Curves and Normal Distribution.
Introduction to Sampling “If you don’t believe in sampling, the next time you have a blood test tell the doctor to take it all.”
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Transformations, Z-scores, and Sampling September 21, 2011.
CHAPTER 3: The Normal Distributions ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Essential Statistics Chapter 31 The Normal Distributions.
CHAPTER 3: The Normal Distributions
Sampling distributions BPS chapter 11 © 2006 W. H. Freeman and Company.
Sampling distributions for sample means
Objectives (BPS chapter 8) Producing data: sampling  Observation versus experiment  Population versus sample  Sampling methods  How to sample badly.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
The Normal distributions BPS chapter 3 © 2006 W.H. Freeman and Company.
Find out where you can find rand and randInt in your calculator. Write down the keystrokes.
BPS - 5th Ed. Chapter 31 The Normal Distributions.
Essential Statistics Chapter 31 The Normal Distributions.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
NORMAL DISTRIBUTION Chapter 3. DENSITY CURVES Example: here is a histogram of vocabulary scores of 947 seventh graders. BPS - 5TH ED. CHAPTER 3 2 The.
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
An opinion poll asks, “Are you afraid to go outside at night within a mile of your home because of crime?” Suppose that the proportion of all adults who.
Ch 2 The Normal Distribution 2.1 Density Curves and the Normal Distribution 2.2 Standard Normal Calculations.
Reminder: What is a sampling distribution? The sampling distribution of a statistic is the distribution of all possible values of the statistic when all.
Chapter 13 Sampling distributions
Normal distributions Normal curves are used to model many biological variables. They can describe a population distribution or a probability distribution.
Reminder: What is a sampling distribution? The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when.
Parameter versus statistic  Sample: the part of the population we actually examine and for which we do have data.  A statistic is a number summarizing.
Statistics for Business and Economics Module 1:Probability Theory and Statistical Inference Spring 2010 Lecture 3: Continuous probability distributions.
THE NORMAL DISTRIBUTION
The Normal Distributions.  1. Always plot your data ◦ Usually a histogram or stemplot  2. Look for the overall pattern ◦ Shape, center, spread, deviations.
Parameter versus statistic
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Density Curves and Normal Distribution
The Practice of Statistics in the Life Sciences Fourth Edition
Chapter 5 Sampling Distributions
Sampling distributions
Chapter 5: Sampling Distributions
Standard Normal Table Area Under the Curve
Standard Normal Table Area Under the Curve
Presentation transcript:

Transformations, Z-scores, and Sampling September 21, 2011

Changing the unit of measurement Variables can be recorded in different units of measurement. Most often, one measurement unit is a linear transformation of another measurement unit: x new = a + bx. Temperatures can be expressed in degrees Fahrenheit or degrees Celsius. Temperature Fahrenheit = 32 + (9/5)* Temperature Celsius  a + bx. Linear transformations do not change the basic shape of a distribution (skew, symmetry, multimodal). But they do change the measures of center and spread: – Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (IQR, s) by b. – Adding the same number a (positive or negative) to each observation adds a to measures of center and to quartiles but it does not change measures of spread (IQR, s).

Density curves A density curve is a mathematical model of a distribution. The total area under the curve, by definition, is equal to 1, or 100%. The area under the curve for a range of values is the proportion of all observations for that range. Histogram of a sample with the smoothed, density curve describing theoretically the population.

Density curves come in any imaginable shape. Some are well known mathematically and others aren’t.

Median and mean of a density curve The median of a density curve is the equal-areas point: the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if it were made of solid material. The median and mean are the same for a symmetric density curve. The mean of a skewed curve is pulled in the direction of the long tail.

Normal distributions e = … The base of the natural logarithm π = pi = … Normal – or Gaussian – distributions are a family of symmetrical, bell-shaped density curves defined by a mean  (mu) and a standard deviation  (sigma) : N(  ). xx

A family of density curves Here, means are different (  = 10, 15, and 20) while standard deviations are the same (  = 3). Here, means are the same (  = 15) while standard deviations are different (  = 2, 4, and 6).

mean µ = 64.5 standard deviation  = 2.5 N(µ,  ) = N(64.5, 2.5) The % Rule for Normal Distributions Reminder: µ (mu) is the mean of the idealized curve, while is the mean of a sample. σ (sigma) is the standard deviation of the idealized curve, while s is the s.d. of a sample.  About 68% of all observations are within 1 standard deviation (  of the mean (  ).  About 95% of all observations are within 2  of the mean .  Almost all (99.7%) observations are within 3  of the mean. Inflection point

Because all Normal distributions share the same properties, we can standardize our data to transform any Normal curve N(  ) into the standard Normal curve N(0,1). The standard Normal distribution For each x we calculate a new value, z (called a z-score). N(0,1) => N(64.5, 2.5) Standardized height (no units)

A z-score measures the number of standard deviations that a data value x is from the mean . Standardizing: calculating z-scores When x is larger than the mean, z is positive. When x is smaller than the mean, z is negative. When x is 1 standard deviation larger than the mean, then z = 1. When x is 2 standard deviations larger than the mean, then z = 2.

mean µ = 64.5" standard deviation  = 2.5" x (height) = 67" We calculate z, the standardized value of x: Because of the rule, we can conclude that the percent of women shorter than 67” should be, approximately,.68 + half of (1 -.68) =.84 or 84%. Area= ??? N(µ,  ) = N(64.5, 2.5)  = 64.5” x = 67” z = 0z = 1 Ex. Women heights Women’s heights follow the N(64.5”,2.5”) distribution. What percent of women are shorter than 67 inches tall (that’s 5’6”)?

Using the standard Normal table (…) Table A gives the area under the standard Normal curve to the left of any z value is the area under N(0,1) left of z = is the area under N(0,1) left of z = is the area under N(0,1) left of z = -2.46

Area ≈ 0.84 Area ≈ 0.16 N(µ,  ) = N(64.5”, 2.5”)  = 64.5” x = 67” z = 1 Conclusion: 84.13% of women are shorter than 67”. By subtraction, , or 15.87% of women are taller than 67". For z = 1.00, the area under the standard Normal curve to the left of z is Percent of women shorter than 67”

Tips on using Table A Because the Normal distribution is symmetrical, there are 2 ways that you can calculate the area under the standard Normal curve to the right of a z value. area right of z = 1 - area left of z Area = Area = z = area right of z = area left of -z

Tips on using Table A To calculate the area between 2 z- values, first get the area under N(0,1) to the left for each z-value from Table A. area between z 1 and z 2 = area left of z 1 – area left of z 2 A common mistake made by students is to subtract both z values. But the Normal curve is not uniform. Then subtract the smaller area from the larger area.  The area under N(0,1) for a single value of z is zero. (Try calculating the area to the left of z minus that same area!)

N(0,1) The cool thing about working with normally distributed data is that we can manipulate it, and then find answers to questions that involve comparing seemingly non- comparable distributions. We do this by “standardizing” the data. All this involves is changing the scale so that the mean now = 0 and the standard deviation =1. If you do this to different distributions it makes them comparable.

Population versus sample Sample: The part of the population we actually examine and for which we do have data. How well the sample represents the population depends on the sample design. A statistic is a number describing a characteristic of a sample.  Population: The entire group of individuals in which we are interested but can’t usually assess directly. Example: All humans, all working- age people in California, all crickets  A parameter is a number describing a characteristic of the population. Population Sample

Convenience sampling: Just ask whoever is around. – Example: “Man on the street” survey (cheap, convenient, often quite opinionated, or emotional => now very popular with TV “journalism”) Which men, and on which street? – Ask about gun control or legalizing marijuana “on the street” in Berkeley or in some small town in Idaho and you would probably get totally different answers. – Even within an area, answers would probably differ if you did the survey outside a high school or a country western bar. Bias: Opinions limited to individuals present. Sampling methods

Voluntary Response Sampling: Individuals choose to be involved. These samples are very susceptible to being biased because different people are motivated to respond or not. Often called “public opinion polls,” these are not considered valid or scientific. Bias: Sample design systematically favors a particular outcome. Ann Landers summarizing responses of readers 70% of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they wouldn’t. Bias: Most letters to newspapers are written by disgruntled people. A random sample showed that 91% of parents WOULD have kids again.

CNN on-line surveys: Bias: People have to care enough about an issue to bother replying. This sample is probably a combination of people who hate “wasting the taxpayers’ money” and “animal lovers.”

In contrast : Probability or random sampling: Individuals are randomly selected. No one group should be over-represented. Random samples rely on the absolute objectivity of random numbers. There are tables and books of random digits available for random sampling. Statistical software can generate random digits (e.g., Excel “=random()”). Sampling randomly gets rid of bias.

Simple random samples A Simple Random Sample (SRS) is made of randomly selected individuals. Each individual in the population has the same probability of being in the sample. All possible samples of size n have the same chance of being drawn. The simplest way to use chance to select a sample is to place names in a hat (the population) and draw out a handful (the sample).

Stratified samples There is a slightly more complex form of random sampling: A stratified random sample is essentially a series of SRSs performed on subgroups of a given population. The subgroups are chosen to contain all the individuals with a certain characteristic. For example: – Divide the population of UCI students into males and females. – Divide the population of California by major ethnic group. – Divide the counties in America as either urban or rural based on criteria of population density. The SRS taken within each group in a stratified random sample need not be of the same size. For example: – A stratified random sample of 100 male and 150 female UCI students – A stratified random sample of a total of 100 Californians, representing proportionately the major ethnic groups

What is a sampling distribution? The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when all possible samples of a fixed size n are taken from the population. It is a theoretical idea—we do not actually build it. The sampling distribution of a statistic is the probability distribution of that statistic.

Sampling distribution of the sample mean We take many random samples of a given size n from a population with mean  and standard deviation  Some sample means will be above the population mean  and some will be below, making up the sampling distribution. Sampling distribution of “x bar” Histogram of some sample averages

Sampling distribution of x bar  √n√n For any population with mean  and standard deviation  :  The mean, or center of the sampling distribution of, is equal to the population mean .  The standard deviation of the sampling distribution is  /√n, where n is the sample size : .

Mean of a sampling distribution of There is no tendency for a sample mean to fall systematically above or below  even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean  — it will be “correct on average” in many samples. Standard deviation of a sampling distribution of The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. It is smaller than the standard deviation of the population by a factor of √n.  Averages are less variable than individual observations.

For normally distributed populations When a variable in a population is normally distributed, the sampling distribution of for all possible samples of size n is also normally distributed. If the population is N(  ) then the sample means distribution is N(  /√n). Population Sampling distribution

The central limit theorem Central Limit Theorem: When randomly sampling from any population with mean  and standard deviation , when n is large enough, the sampling distribution of is approximately normal: ~ N(  /√n). Population with strongly skewed distribution Sampling distribution of for n = 2 observations Sampling distribution of for n = 10 observations Sampling distribution of for n = 25 observations

The National Collegiate Athletic Association (NCAA) requires Division I athletes to score at least 820 on the combined math and verbal SAT exam to compete in their first college year. The SAT scores of 2003 were approximately normal with mean 1026 and standard deviation 209. What proportion of all students would be NCAA qualifiers (SAT ≥ 820)? Note: The actual data may contain students who scored exactly 820 on the SAT. However, the proportion of scores exactly equal to 820 is 0 for a normal distribution is a consequence of the idealized smoothing of density curves. area right of 820= total area - area left of 820 = ≈ 84%

The NCAA defines a “partial qualifier” eligible to practice and receive an athletic scholarship, but not to compete, with a combined SAT score of at least 720. What proportion of all students who take the SAT would be partial qualifiers? That is, what proportion have scores between 720 and 820? About 9% of all students who take the SAT have scores between 720 and 820. area between = area left of area left of and 820= ≈ 9%

IQ scores: population vs. sample In a large population of adults, the mean IQ is 112 with standard deviation 20. Suppose 200 adults are randomly selected for a market research campaign. The distribution of the sample mean IQ is: A) Exactly normal, mean 112, standard deviation 20 B) Approximately normal, mean 112, standard deviation 20 C) Approximately normal, mean 112, standard deviation D) Approximately normal, mean 112, standard deviation 0.1 C) Approximately normal, mean 112, standard deviation Population distribution : N(  = 112;  = 20) Sampling distribution for n = 200 is N(  = 112;  /√n = 1.414)

, P(z < −3) = ≈ 0.1% Note: Make sure to standardize (z) using the standard deviation for the sampling distribution. Application Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/dl. Let’s assume that we know a patient whose measured potassium levels vary daily according to a normal distribution N(  = 3.8,  = 0.2). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?, P(z < −1.5) = ≈ 7% Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis?

Practical note Large samples are not always attainable. – Sometimes the cost, difficulty, or preciousness of what is studied drastically limits any possible sample size. – Blood samples/biopsies: No more than a handful of repetitions are acceptable. Oftentimes, we even make do with just one. – Opinion polls have a limited sample size due to time and cost of operation. During election times, though, sample sizes are increased for better accuracy. Not all variables are normally distributed. – Income, for example, is typically strongly skewed. – Is still a good estimator of  then?