GATHERING AND PRODUCING DATA.

Slides:



Advertisements
Similar presentations
Chapter 5 Sample Surveys. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular.
Advertisements

Sampling.
Sampling Design Questions, questions, questions –Do you support U.S. role in Iraq?
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 12 Sample Surveys.
Copyright © 2010 Pearson Education, Inc. Slide
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 12 Sample Surveys.
Sample Surveys Chapter 12.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Background We have learned ways to display, describe, and summarize.
Literary Digest Poll 1936 election: Franklin Delano Roosevelt vs. Alf Landon Literary Digest had called the election since 1916 Sample size: 2.4 million!
Chapter 12 Sample Surveys. At the end of this chapter, you should be able to Identify populations, samples, parameters and statistics for a given problem.
Chapter 12 Sample Surveys
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. How to Get a Good Sample Chapter 4.
Sample Surveys Ch. 12. The Big Ideas 1.Examine a Part of the Whole 2.Randomize 3.It’s the Sample Size.
LT 4.1—Sampling and Surveys Day 3 Notes--Bias
PRODUCING DATA. A look at your class The class survey The class survey.
Copyright © 2011 Pearson Education, Inc. Samples and Surveys Chapter 13.
Chapter 12: AP Statistics
From Sample to Population Often we want to understand the attitudes, beliefs, opinions or behaviour of some population, but only have data on a sample.
Where Do Data Come From? ● Conceptualization and operationalization of concepts --> measurement strategy --> data. ● Different strategies --> different.
 Sampling Design Unit 5. Do frog fairy tale p.89 Do frog fairy tale p.89.
Sample Surveys.  The first idea is to draw a sample. ◦ We’d like to know about an entire population of individuals, but examining all of them is usually.
Introduction to Sampling “If you don’t believe in sampling, the next time you have a blood test tell the doctor to take it all.”
Chapter 12 Designing Good Samples. Doubting the Holocaust? An opinion poll conducted in 1992 for the American Jewish Committee asked: Does it seem possible.
Sample surveys and polls. YearSample size WinnerGallup prediction Election result Error 1936~50,000Roosevelt55.7% ↑62.5%-6.8% 1940~50,000Roosevelt52.0%
Chapter 12 Notes Surveys, Sampling, & Bias Examine a Part of the Whole: We’d like to know about an entire population of individuals, but examining all.
Chapter 12 Sample Surveys *Sample *Bias *Randomizing *Sample Size.
Sampling Design Notes Pre-College Math.
Part III Gathering Data.
Chapter 12 Sample Surveys
Objectives Chapter 12: Sample Surveys How can we make a generalization about a population without interviewing the entire population? How can we make a.
Slide 12-1 Copyright © 2004 Pearson Education, Inc.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 11, Slide 1 Background We have learned ways to display, describe, and summarize data,
C HAPTER 5: P RODUCING D ATA Section 5.1 – Designing Samples.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Section 5.1 Designing Samples AP Statistics
AP STATISTICS LESSON AP STATISTICS LESSON DESIGNING DATA.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Training Activity 8 - Surveys Sample Surveys.
Part III – Gathering Data
I can identify the difference between the population and a sample I can name and describe sampling designs I can name and describe types of bias I can.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 11, Slide 1 Chapter 11 Sample Surveys.
Chapter 12 Sample Surveys math2200. How to generalize beyond the data? Three ideas Examine a part of the whole Randomize Sample size.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 12 Sample Surveys Survey Says…
 An observational study observes individuals and measures variable of interest but does not attempt to influence the responses.  Often fails due to.
Chapter 3 Surveys and Sampling © 2010 Pearson Education 1.
Copyright © 2010 Pearson Education, Inc. Chapter 12 Sample Surveys.
1 Data Collection and Sampling ST Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical.
Chapter 5 Sampling and Surveys. Section 5.3 Sample Surveys in the Real World.
We’ve been limited to date being given to us. But we can collect it ourselves using specific sampling techniques. Chapter 12: Sample Surveys.
Copyright © 2006 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 13 Samples and Surveys.
Copyright © 2010 Pearson Education, Inc. Chapter 12 Sample Surveys.
Copyright © 2009 Pearson Education, Inc. Chapter 12 Sample Surveys.
Ten percent of U. S. households contain 5 or more people
MATH Section 6.1. Sampling: Terms: Population – each element (or person) from the set of observations that can be made Sample – a subset of the.
Chapter 12 Sample Surveys.
Sample Surveys.
Part III – Gathering Data
Section 5.1 Designing Samples
Chapter 12 Sample Surveys
Chapter 10 Samples.
CHAPTER 12 Sample Surveys.
Chapter 12 Sample Surveys Copyright © 2010 Pearson Education, Inc.
Inference for Sampling
Wednesday, October 19, 2016 Warm-up
Chapter 12 Sample Surveys Copyright © 2010 Pearson Education, Inc.
Chapter 12 Sample Surveys
MATH 2311 Section 6.1.
Presentation transcript:

GATHERING AND PRODUCING DATA

How Data are Obtained Census Observational Study Experiment Everyone is included Observational Study Observes individuals and measures variables but does not attempt to influence responses Includes surveys and polls Experiment Deliberately imposes some treatment on individuals in order to observe their responses In medicine, this is called a clinical trial

3 BIG ideas Examine a part of the whole: take a sample from a population Randomization insures the sample is representative The size of the sample is what’s important, not the size of the population

Big Idea #1: Examine Part of the Whole We are studying an entire population of individuals (or subjects), but looking at everyone is practically impossible. How many support the U.S. role in Iraq? What percent of the tomato shipment is bad? How many children are obese? What’s the price of gas at the pump across Minnesota? Settle for looking at a smaller group—a sample—selected from the population. Sampling is natural! Think about cooking. You taste (sample) a small part to get an idea about the dish as a whole.

Populations and parameters, samples and statistics (This stuff is important!) A parameter is a numerical quantity that describes a population. A statistic is a numerical quantity that describes the sample. We study a population by looking at a sample. We infer about a parameter by using statistics from the sample. Notation: use Greek letters for parameters and Latin letters for statistics

Example: Polling Minneapolis Star Tribune: “A Gallup Poll, conducted Aug. 16-18, 1999, asked, ‘Do you consider pro-wrestling to be a sport, or not?’ Of the people polled, 19% said, “Yes.” (Results were based on telephone interviews with a randomly selected national sample of 1,028 adults, 18 years and older.)” What’s the population, parameter, sample, statistic? Population: Americans, 18 years and older Sample: The 1,028 people who were polled Parameter: The proportion of American adults who believe pro-wrestling is a sport. (Called the population proportion.) p = ? Statistic: The proportion of people in the sample who said they believe pro-wrestling is a sport. (Called the sample proportion.) = 0.19

Example: Surveying a lot shipment A carload of ball bearings has an average diameter of 2.502 centimeters. This is within the specifications for acceptance of the lot by the purchaser. An inspector happens to inspect 100 bearings from the lot and finds the average diameter of these to be 2.499 cm. This is within the specified limits, so the entire lot is accepted. What’s the population, parameter, sample, statistic? Population: The carload of ball bearings Sample: The 100 ball bearings that were inspected Parameter: The average diameter of the ball bearings in the carload. µ = 2.502 cm (The population mean.) Statistic: The average diameter of the 100 ball bearings in the sample. = 2.499 cm (The sample mean.)

Big Idea #2: Randomization Randomization makes sure that on average the sample looks like the rest of the population. Randomization makes it possible to use quantitative tools (probability) to draw inferences about the population when we see only a sample. Randomization protects against bias.

“Who will you vote for in 2008?” Some examples of biased samples 100 people at the Mall of America 100 people in front of the Metrodome after a Twins game 100 friends, family and relatives 100 people who volunteered to answer a survey question on your web site 100 people who answered their phone during supper time The first 100 people you see after you wake up in the morning

Bias – the bane of sampling Samples that systematically misrepresent individuals in the population are said to be biased. Bias is the systematic failure of a sample to represent its population There is usually no way to fix a biased sample and no way to salvage useful information from it. The best way to avoid bias is to select individuals for the sample at random. The value of deliberately introducing randomness is one of the great insights of Statistics.

Simple Random Sample (SRS) Suppose we want to draw a sample of size n from some population For a simple random sample, every possible subset of size n has an equal chance to be selected and to become the sample. Such samples guarantee that each individual has an equal chance of being selected. Each combination of people also has an equal chance of being selected. The sampling frame is a list of the population from which the sample is drawn. From the sampling frame, we can choose a SRS using random numbers.

SRS and Sampling Variability Samples drawn at random generally differ from one another. These differences lead to different values for the variables we measure. Sample-to-sample differences are called sampling variability This is different from bias! Example: Everyone pick 10 Skittles at random from “The Bowl” and count how many reds. The variability of the different sample counts is sampling variability. If half the class peeked and tried to get more reds the differences would reflect bias.

Sources of sampling error In the context of using a sample to estimate a population parameter, sampling variability is sometimes called “sampling error.” Taking a SRS of 3 students to estimate the average height of all students will have a large sampling error, but it is not biased. Taking a sample of 300 basketball players to estimate the average height of all students will produce less variability but the sample is biased.

More complex sampling designs Simple random sampling is not the only way to sample. More complicated designs may save time or money or help avoid sampling problems. Stratified sampling Cluster sampling Systematic sampling Multi-stage sampling All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select the sample.

Stratified sampling Suppose we want a sample of 240 Carleton students We also want to insure discipline representation The student body divides as Arts and Literature 20% Humanities 15% Social Sciences 30% Mathematics and Natural Sciences 35% For the sample, select 240 x .20 = 48 Arts and Lit students 240 x .15 = 36 Humanities students 240 x .30 = 72 Social science students 240 x .35 = 84 Natural science students Within each discipline, choose a SRS

Stratified Sampling The population is divided into homogeneous groups, called strata, before the sample is selected. Then simple random sampling is used within each stratum before the results are combined. Advantages Sample will be representative for the strata Reduces sampling variability Disadvantages May be logistically difficult if even possible to implement Must have information about the population Note: a stratified sample is not a SRS

Cluster sampling Sometimes stratifying isn’t practical and simple random sampling is difficult. Splitting the population into clusters can make sampling more practical. Suppose you want to do a face-to-face survey of attitudes in Minnesota based on a sample of size 600. Choosing 600 people at random, finding their addresses, and meeting them in person is costly and time-consuming. Another idea: Choose some cities at random. Then some streets at random, and then some blocks at random. Interview everyone on the selected blocks. The blocks are the clusters. If you know there are about 20 people per block. Then choose a random sample of 30 blocks.

Cluster sampling in the news: The Lancet study on Iraq casualties In October 2006, The Lancet published “Iraq mortality after the 2003 invasion: a cross-sectional cluster sample survey” The study was controversial because of its findings that hundreds of thousands of Iraqis (most likely about 650,000) had been killed since the U.S. invasion. Earlier reports, including the U.S. and British government had put the number at about 30,000. The study was based on cluster sampling, a common methodology in public health and human rights work The clusters were groups of 40 houses in close proximity whose locations were chosen based on population demographics.

Cluster Sampling If each cluster fairly represents the population, cluster sampling will give an unbiased sample. Advantage Easier to implement depending on context Disadvantage Greater sampling variability, so less statistical accuracy

Multistage Sampling Most surveys conducted by the government or professional polling organizations use some combination of stratified and cluster sampling as well as simple random sampling. Current Population Survey is how the government estimates the unemployment rate Counties are divided into 2,007 Primary Sampling Units PSUs are divided into smaller census blocks. And the blocks are grouped into strata. Households in each block are grouped into clusters of about 4 households each The final sample consists of these clusters and interviewers go to all households in the chosen clusters.

Systematic Samples Sometimes we draw a sample by selecting individuals systematically. For example, you might survey every 10th person on an alphabetical list of students. To make it random, you must still start the systematic selection from a randomly selected individual. When there is no reason to believe that the order of the list could be associated in any way with the responses sought, systematic sampling can give a representative sample. Systematic sampling can be much less expensive than true random sampling.

Sampling Example Hospital administrators are concerned about the possibility of drug abuse among employees. They plan to pick a sample of 40 from 800 employees, and administer a drug test. What’s the sampling strategy? Randomly select 10 doctors, 10 nurses, 10 office staff, and 10 support staff for the test. Each employee has a 4-digit ID number. Randomly choose 40 numbers. At the start of each shift, choose every 20th person who arrives for work. There are 40 departments of 20 employees each. Randomly choose two departments (say radiology and ER) and test all the people who work in that department.

Big Idea #3: Sample size is key, not population size How large a sample size do we need for the sample to be reasonably representative of the population? In general, it’s the size of the sample, not the size of the population, that makes the difference in sampling. The fraction of the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important Back to cooking: If the soup is mixed enough a tablespoon will suffice, whether you’re “sampling” from a saucepan or from a barrel.

How big a sample? Most professional polls choose a sample size of about 1,000 people. These polls report a “margin of error” of about 3%. That means that with “high confidence” their estimates are within 3% of the true population parameter value. The margin of error for a sample of 1,000 people is the same for Minneapolis (pop. 400,000), Minnesota (pop. 5 million), and the U.S. (pop. 290 million) But the bad news is that if you want similar accuracy at Carleton, you need to poll over half the student body. Coming Attractions: Margin of Error = and . But you’ll have to wait until we get to Statistical Inference to learn why.

How to Sample Badly Advice columnist Ann Landers once asked parents “If you had it to do over again, would you have children?” Do you think responses were representative of public opinion? Over 100,000 people responded, and 70% answered “No”! A later survey, more carefully designed, showed 90% of parents are happy with their decision to have children. In a voluntary response sample, a large group of individuals is invited to respond, and all who do respond are counted. But such samples are almost always biased toward those with strong opinions or those who are strongly motivated. Since the sample is not representative, the resulting voluntary response bias invalidates the survey.

What Can Go Wrong?—or, How to Sample Badly In convenience sampling, we simply include the individuals who are convenient. But they may not be representative of the population. A psychology professor performs an experiment using his classroom. A company samples opinions by using its own customers. Sampling mice from a large cage to study how a drug affects physical activity: The lab assistant reaches into the cage to select the mice one at a time until 10 are chosen. But which mice will likely be chosen?

Other problems Under-coverage: Non-response In some survey designs a portion of the population is not sampled or has a smaller representation in the sample than it has in the population. Using telephone directories for phone survey. Half the households in large cities are unlisted. About 5% of households don’t have phones. Random digit dialing only partially addresses this problem Misses students in dorms, inmates in prison, soldiers in the military, homeless people. And it’s too expensive to call Hawaii or Alaska. Non-response No survey succeeds in getting responses from everyone. The problem is that those who don’t respond may differ from those who do. Bureau of Labor Statistics get 6-7% non-response rate. But it’s common for opinion polls and market research studies to have 75- 80% non-response rate.

What Else Can Go Wrong? Response bias refers to anything in the survey design that influences the responses In particular, the wording of a question can have a big impact on the responses:

Some classic statistical mistakes The Literary Digest Poll 1936 presidential election: Franklin Delano Roosevelt vs. Alf Landon The Literary Digest had called every presidential election since 1916 Sample size: 2.4 million! They predicted Roosevelt would lose by 43% In fact it was a landslide for Roosevelt at 62%

Literary Digest poll Context How the polling was done Midst of the Great Depression 9 million unemployed; real income down 1/3 Landon’s program: “Cut spending” Roosevelt’s program: “Balance peoples’ budgets before the government’s budget” How the polling was done Survey sent to 10 million people And 2.4 million responded (that’s huge!)

A huge sample, but The Literary Digest poll was biased The sampling frame was not representative of the electorate—selection bias Based on magazine subscription lists, drivers’ registrations, country club memberships, phone numbers (when telephones were a luxury) Biased toward better off groups (who were more Republican) Voluntary response bias Main issue was the economy The anti-Roosevelt forces were angry—and had a higher response rate!

Year Sample size Winner Gallup prediction Election result Error 1936 ~50,000 Roosevelt 55.7% 62.5% -6.8% 1940 52.0% 55.0% -3.0% 1944 51.5% 53.8% -2.3% 1948 Truman 44.5% 49.5% -5.0% 1952 5,385 Eisenhower 51.0% 55.4% -4.4% 1956 8,144 59.5% 57.8% +1.7% 1960 8,015 Kennedy 50.1% +0.9% 1964 6,625 Johnson 64.0% 61.3% +2.7% 1968 4,414 Nixon 43.0% 43.5% -0.5% 1972 3,689 62.0% 61.8% +0.2% 1976 3,439 Carter 48.0% -2.1% 1980 3,500 Reagan 47.0% 50.8% -3.8% 1984 3,456 59.0% 59.2% 1988 4,089 Bush 56.0% 53.9% +2.1% 1992 2,019 Clinton 49% 43.3% +5.7% 1996 2.,417 +1.9% 2000 3,129 47.9% +0.1% 2004 1,866 49.0% -2.0%

The Year the Polls Elected Dewey 1948 Election: Harry Truman versus Thomas Dewey Every major poll (including Gallup) predicted Dewey would win by 5 percentage points

What went wrong? Pollsters chose their samples using quota sampling. Each interviewer was assigned a fixed quota of subjects in certain categories (race, sex, age). For instance, an interviewer in St. Louis was required to talk to 13 people: 6 live in the suburb, 7 in the central city 7 men and 6 women; Over the 7 men (similar for women): 3 under 40 years old, 4 over 40; 1 black, 6 white. In each category, interviewers were free to choose. But this left room for human choice and inevitable bias. Republicans were easier to reach. They had telephones, permanent addresses, “nicer” neighborhoods. So interviewers ended up with too many Republicans. Quota sampling was abandoned for random sampling.

Do you believe the poll? What questions should you ask? Who carried out survey? What is the population? How was sample selected? How large was the sample? What was the response rate? How were subjects contacted? When was the survey conducted? What are the exact questions asked?

To summarize . . . We are often interested in a population and some parameter that describes the population. We select a sample from that population and use a statistic from the sample to estimate the unknown parameter To obtain a good estimate, the sample must be as representative of the population as possible. And randomization, on average, insures a representative sample Possible sources of error are sampling variability and bias. To reduce sampling variability, take a bigger sample To reduce bias, get a better sampling design It’s the sample size, not the population size, that matters