Download presentation
Presentation is loading. Please wait.
Published byCharlene Clark Modified over 9 years ago
2
Lecture Unit 3 Sample Surveys Producing Valid Data “If you don’t believe in random sampling, the next time you have a blood test tell the doctor to take it all.”
3
The election of 1948 The Predictions The Candidates Crossley Gallup Roper The Results Truman 45443850 Dewey 50505345
4
Lecture Unit 3 Objectives 1. Given a survey sample, determine whether the sample is a simple random sample, a stratified sample, a cluster sample, or a systematic sample. 2. Choose a simple random sample, stratified random sample, cluster sample, and systematic random sample in a variety of situations. 3. Explain the affect of sample size when determining whether a sample is representative of the population.
5
Beyond the Data at Hand to the World at Large H We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data we have. H We’d like (and often need) to stretch beyond the data at hand to the world at large. H Let’s investigate three major ideas that will allow us to make this stretch…
6
3 Key Ideas That Enable Us to Make the Stretch
7
Idea 1: Examine a Part of the Whole H The first idea is to draw a sample. –We’d like to know about an entire population of individuals, but examining all of them is usually impractical, if not impossible. –We settle for examining a smaller group of individuals—a sample—selected from the population.
8
Examples of Samples 1.Think about sampling something you are cooking—you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole. 2.Opinion polls are examples of sample surveys, designed to ask questions of a small group of people in the hope of learning something about the entire population.
9
Convenience sampling: Just ask whoever is around. –Example: “Man on the street” survey (cheap, convenient, often quite opinionated or emotional => now very popular with TV “journalism”) H Which men, and on which street? –Ask about gun control or legalizing marijuana “on the street” in Berkeley or in some small town in Idaho and you would probably get totally different answers. –Even within an area, answers would probably differ if you did the survey outside a high school or a country western bar. Bias: Opinions limited to individuals present. Sampling methods
10
Voluntary Response Sampling: H Individuals choose to be involved. These samples are very susceptible to being biased because different people are motivated to respond or not. Often called “public opinion polls.” These are not considered valid or scientific. H Bias: Sample design systematically favors a particular outcome. Ann Landers summarizing responses of readers 70% of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they wouldn’t. Bias: Most letters to newspapers are written by disgruntled people. A random sample showed that 91% of parents WOULD have kids again.
11
CNN on-line surveys: Bias: People have to care enough about an issue to bother replying. This sample is probably a combination of people who hate “wasting the taxpayers money” and “animal lovers.”
12
Bias H The Challenge –Obtain a sample that is perfectly representative of the population. –Avoid bias – over or under emphasizing some characteristic of the population that is pertinent to the study. –Samples with bias are inherently flawed.
13
Landon(R) Beats Roosevelt(D)?? The Survey 1936, Literary Digest received 2.4 million mail-in ballots. Used names from the phone book The Results Landon leads 57% to 43% The Problem Only high income earners could afford a phone. This was a very biased survey. Literary Digest soon went out of business.
14
Example: hospital employee drug use Why might this result in a biased sample? Dept. might not represent full range of employee types, experiences, stress levels, or the hospital’s drug supply Administrators at a hospital are concerned about the possibility of drug abuse by some employees of the hospital. They decide to check on the extent of the problem by having a random sample of the employees undergo a drug test. The administrators randomly select a department (say, radiology) and test all the people who work in that department – doctors, nurses, technicians, clerks, custodians, etc.
15
Example (cont.) Name the kind of bias that might be present if the administration decides that instead of subjecting people to random testing they’ll just… a. interview employees about possible drug abuse. Response bias: people will feel threatened, won’t answer truthfully b. ask people to volunteer to be tested. Voluntary response bias; only those who are “clean” would volunteer
16
Bias Bias is the bane of sampling—the one thing above all to avoid. There is usually no way to fix a biased sample and no way to salvage useful information from it. The best way to avoid bias is to select individuals for the sample at random. The value of deliberately introducing randomness is one of the great insights of Statistics – Idea 2
17
Idea 2: Randomize Randomization can protect you against factors that you know are in the data. –It can also help protect against factors you are not even aware of. Randomizing protects us from the influences of all the features of our population, even ones that we may not have thought about. –Randomizing makes sure that on the average the sample looks like the rest of the population
18
Idea 2: Randomize (cont.) Individuals are randomly selected. No one group should be over- represented. Sampling randomly gets rid of bias. Random samples rely on the absolute objectivity of random numbers. There are tables and books of random digits available for random sampling. Statistical software can generate random digits (e.g., Excel “=random()”, ran# button on calculator).
19
Idea 2: Randomize (cont.) H Not only does randomizing protect us from bias, it actually makes it possible for us to draw inferences about the population when we see only a sample.
20
Hospital example (cont.) H Listed in the table are the names of the 20 pharmacists on the hospital staff. Use the random numbers listed below to select three of them to be in the sample. H 04905 83852 29350 91397 19994 65142 05087 11232
21
01 NCSU 02 UNC 03 Duke 04 Wake F 05 BC 06 UM 07 Maryl. 08 Clem 09 UVA 10 VaTech 11 GaTech 12 FSU 13 OSU 14 ILL 15 IN 16 PUR 17 IOWA 18 MSU 19 Mich 20 PennS 21 NorthW 22 MN 23 WISC 96927 19931 36089 74192 77567 88741 48409 41903 The first 3 schools in a random sample selected from the ACC and Big Ten using the above random numbers are: 1. UVA, UM, UNC 2. UVA, NCSU, Duke 3. UVA, UM, UVA 4. Clem, Mich, Duke 5. Mich, OSU, Maryl Countdown 10
22
Idea 3: It’s the Sample Size!! How large a random sample do we need for the sample to be reasonably representative of the population? It’s the size of the sample, not the size of the population, that makes the difference in sampling. –Exception: If the population is small enough and the sample is more than 10% of the whole population, the population size can matter. The fraction of the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important.
23
Example i) In the city of Chicago, Illinois, 1,000 likely voters are randomly selected and asked who they are going to vote for in the Chicago mayoral race. ii) In the state of Illinois, 1,000 likely voters are randomly selected and asked who they are going to vote for in the Illinois governor's race. iii) In the United States, 1,000 likely voters are randomly selected and asked who they are going to vote for in the presidential election. Which survey has more accuracy? All the surveys have the same accuracy
24
Idea 3: It’s the Sample Size!! H Chicken soup H Blood samples
25
How Big a Sample Do You Need? To find the proportion of a certain category: Several hundred are needed. A smaller sample size will result in such a low precision that the results will have little use. To find the average value: The sample size needed to get useful results will vary. What do pollsters do? Generate random phone number (cell and landline) Representative even though only 38% answer.
26
Does a Census Make Sense? Why bother worrying about the sample size? Wouldn’t it be better to just include everyone and “sample” the entire population? –Such a special sample is called a census.
27
Does a Census Make Sense? (cont.) There are problems with taking a census: –Practicality: It can be difficult to complete a census— there always seem to be some individuals who are hard to locate or hard to measure. –Timeliness: populations rarely stand still. Even if you could take a census, the population changes while you work, so it’s never possible to get a perfect measure. –Expense: taking a census may be more complex than sampling. –Accuracy: a census may not be as accurate as a good sample due to data entry error, inaccurate (made-up?) data, tedium.
28
Population versus sample Population: The entire group of individuals in which we are interested but can’t usually assess directly. Example: All humans, all working-age people in California, all crickets A parameter is a number describing a characteristic of the population. Sample: The part of the population we actually examine and for which we do have data. How well the sample represents the population depends on the sample design. A statistic is a number describing a characteristic of a sample. Population Sample
29
Sample Statistics Estimate Parameters Values of population parameters are unknown; in addition, they are unknowable. Example: The distribution of heights of adult females (at least 18 yrs of age) in the United States is approximately symmetric and mound-shaped with mean µ. µ is a population parameter whose value is unknown and unknowable The heights of 1500 females are obtained from a sample of government records. The sample mean x of the 1500 heights is calculated to be 64.5 inches. The sample mean x is a sample statistic that we use to estimate the unknown population parameter µ
30
We typically use Greek letters to denote parameters and Latin letters to denote statistics.
31
Various claims are often made for surveys. Why are each of the following claims not correct? It is always better to take a census than a sample Timeliness, expense, complexity, accuracy Stopping students on their way out of the cafeteria is a good way to sample if we want to know the quality of the food in the cafeteria. Bias; they chose to eat at the cafeteria We drew a sample of 100 from the 3,000 students at a small college. To get the same level of precision for a town of 30,000 residents, we'll need a sample of 1,000 residents. It’s the sample size, not the size of the population or the fraction of the population that we sample, that is important.
32
Survey claims (cont.) An internet poll taken at the web site www.statsisfun.org garnered 12,357 responses. The majority said they enjoy doing statistics homework. With a sample size that large, we can be pretty sure that most Statistics students feel this way, too. Voluntary response bias; size of sample does not remove the bias. The true percentage of all Statistics students who enjoy the homework is called a “population statistic.” The true percentage is a population parameter
34
3.2, 3.3 Probability Sampling In a probability sample, every unit in the population of interest (for example, all registered voters in a political poll) has a known chance of being selected for inclusion in the sample. http://abcn.ws/1rtIOBR
35
3.2 Simple Random Samples H Desire the sample to be representative of the population from which the sample is selected H Each individual in the population should have an equal chance to be selected H Is this good enough?
36
Example Select a sample of high school students as follows: 1. Flip a fair coin 2. If heads, select all female students in the school as the sample 3. If tails, select all male students in the school as the sample Each student has an equal chance to be in the sample Every sample a single gender, not representative Each individual in the population has an equal chance to be selected. Is this good enough? NO!!
37
Simple Random Samples H A simple random sample (SRS) of size n consists of n items from the population chosen in such a way that every set of n units has an equal chance to be the sample actually selected.
38
Simple Random Samples (cont.) Suppose a large History class of 500 students has 250 male and 250 female students. To select a random sample of 250 students from the class, I flip a fair coin one time. If the coin shows heads, I select the 250 males as my sample; if the coin shows tails I select the 250 females as my sample. What is the chance any individual student from the class is included in the sample? This is a random sample. Is it a simple random sample? ½ NO! Not every possible group of 250 students has an equal chance to be selected. Every sample consists of only 1 gender – hardly representative.
39
Sampling Frame To select a sample at random, we first need to define where the sample will come from. –The sampling frame is a list of individuals from which the sample is drawn. –E.g., To select a random sample of students from NCSU, we might obtain a list of all registered full-time students from Registration & Records. –When defining sampling frame, must deal with details defining the population; are part-time students included? How about current study-abroad students? Once we have our sampling frame, the easiest way to choose an SRS is with random numbers.
40
Warning! If some members of the population are not included in the sampling frame, they cannot be part of the sample!! (e. g., using a telephone book as the sampling frame) Population: Wal Mart shoppers Sampling frame?
41
Example: simple random sample H Academic dept wishes to randomly choose a 3-member committee from the 28 members of the dept 01 Abbott08 Goodwin15 Pillotte22 Theobald 02 Cicirelli09 Haglund16 Raman23 Vader 03 Crane10 Johnson17 Reimann24 Wang 04 Dunsmore11 Keegan18 Rodriguez25 Wieczoreck 05 Engle12 Lechtenb’g 19 Rowe26 Williams 06 Fitzpat’k13 Martinez20 Sommers27 Wilson 07 Garcia14 Nguyen21 Stone28 Zink
42
Solution Use a random number table; read 2-digit pairs until you have chosen 3 committee members For example, start in row 121: 71487 09984 29077 14863 61683 47052 62224 51025 Garcia (07) Theobald (22) Johnson (10) Your calculator generates random numbers; you can also generate random numbers using Excel
43
Sampling Variability Suppose we had started in line 145? 19687 12633 57857 95806 09931 02150 43163 58636 Our sample would have been 19 Rowe, 26 Williams, 06 Fitzpatrick
44
Sampling Variability Samples selected at random generally differ from one another. Each selection of random numbers selects different people for our sample. These differences lead to different values for the variables we measure. We call these sample-to-sample differences sampling variability. Variability is OK; bias is bad!!
46
3.3 Other Probability-Based Sampling Designs 1. Stratified random samples 2. Cluster random samples 3. Systematic random samples 4. Multi-stage random samples
47
H This sampling procedure separates the population into mutually exclusive homogenous groups called strata. Then select simple random samples from each stratum. Sex Male Female Age under 20 20-30 31-40 41-50 Occupation professional clerical blue-collar 3.3 Stratified Random Sampling
48
Reduces Bias At least there will not be bias towards any of the strata Reduces Sampling Variability Estimates will be more precise (if the different strata tend to have higher or lower values, then sampling variability is reduced.) With stratified sampling we can acquire information about –the whole population –each stratum –the relationships among strata. Benefits of Stratified Random Sampling
49
Stratified Random Sampling There are several ways to build the stratified sample. For example, keep the proportion of each stratum in the population. A sample of size 1,000 is to be drawn Stratum Income Population proportion 1 under $15,000 25% 250 2 15,000-29,999 40% 400 3 30.000-50,00030%300 4over $50,000 5% 50 Stratum size Total 1,000
50
3.3 Cluster Random Sampling Sometimes stratifying isn’t practical and simple random sampling is difficult. Splitting the population into similar parts or clusters can make sampling more practical. Then we could select one or a few clusters at random and select an SRS or perform a census within each selected cluster. This sampling design is called cluster sampling. If each cluster fairly represents the full population, cluster sampling will give us an unbiased sample.
51
Cluster Sampling Useful When… it is difficult and costly to develop a complete list of the population members (making it difficultto develop a simple random sampling procedure.) e.g., all items sold in a grocery store the population members are widely dispersed geographically. e.g., all Toyota dealerships in North Carolina
52
Mean length of sentences in our course text We would like to assess the reading level of our course text based on the length of the sentences. Simple random sampling would be awkward: number each sentence in the book? Better way: choose a few pages at random (the pages are the clusters, and it's reasonable to assume that each page is representative of the entire text). count the length of all sentences on the selected pages or select a SRS of sentences from each of the selected pages.
53
Cluster sampling - not the same as stratified sampling!! We stratify to ensure that our sample represents different groups in the population, and sample randomly within each stratum. Clusters are more or less alike, each heterogeneous and resembling the overall population. We randomly select a few clusters to make sampling more practical or affordable. We select an SRS or conduct a census on each selected cluster. Strata are homogenous (e.g., stratum 1: males, stratum 2: females) but differ from one another
54
3.3 Systematic Random Sampling Sometimes we draw a sample by selecting individuals systematically. For example, you might survey every 10th person on an alphabetical list of students. To make it random, you must still start the systematic selection from a randomly selected individual. When there is no reason to believe that the order of the list could be associated in any way with the responses sought, systematic sampling can give a representative sample. Systematic sampling can be much less expensive than true random sampling. When you use a systematic sample, you need to justify the assumption that the systematic method is not associated with any of the measured variables.
55
Systematic Sampling-example You want to select a sample of 50 students from a college dormitory that houses 500 students. On a list of all students living in the dorm, number the students from 001 to 500. Generate a random number between 001 and 010, and start with that student. Every 10th student in the list becomes part of your sample. For example: 3, 13, 23, 33, 43, 53, …, 493. Questions: 1) does each student have an equal chance to be in the sample? 2) what is the chance that a student is included in the sample? 3) is this an SRS? Yes 1/10 No
56
3.3 Multistage Sampling Sometimes we use a variety of sampling methods together. Sampling schemes that combine several methods are called multistage samples. Most surveys conducted by professional polling organizations and government agencies use some combination of stratified and cluster sampling as well as simple random sampling.
57
Mean length of sentences in our course text, cont. In attempting to assess the reading level of our course text: we might worry that it starts out easy and gets harder as the concepts become more difficult we want to avoid samples that select too heavily from early or from late chapters Suppose our course text has 5 sections, with several chapters in each section.
58
Mean length of sentences in our course text, cont. We could: i) randomly select 1 chapter from each section ii) randomly select a few pages from each of the selected chapters iii) if altogether this makes too many sentences, we could randomly select a few sentences from each page. So what is our sampling strategy? i) we stratify by section of the book ii) we randomly choose 1 chapter from each section to represent each stratum (section) iii) within each chapter we randomly choose pages as clusters iv) finally, we choose an SRS of sentences from the randomly chosen pages.
59
Summary: What have we learned? H A representative sample can offer us important insights about populations. –It’s the size of the sample, not its fraction of the larger population, that determines the precision of the statistics it yields. H There are several ways to draw probability samples, all based on the power of randomness to make them representative of the population of interest: –Simple Random Sample, Stratified Random Sample, Cluster Random Sample, Systematic Random Sample, Multistage Random Sample
60
Summary: What have we learned? (cont.) H Bias can destroy our ability to gain insights from our sample: –Nonresponse bias can arise when sampled individuals will not or cannot respond. –Response bias arises when respondents’ answers might be affected by external influences, such as question wording or interviewer behavior.
61
Summary: What have we learned? (cont.) H Bias can also arise from poor sampling methods: –Voluntary response samples are almost always biased and should be avoided and distrusted. –Convenience samples are likely to be flawed for similar reasons. –Even with a reasonable design, nonrepresentative sample frames create bias. Undercoverage occurs when individuals from a subgroup of the population are selected less often than they should be.
62
Opinion Polling: What’s Wrong Lately? Prediction slippage: 2012 US presidential election (correct winner but not very accurate) Predictions way off: 2014 US midterms 2014 Scottish independence referendum 2015 UK election 2015 Israeli general election 2015 Greek bailout vote
63
Response Rates Declining
64
Contacting People-Extremely Difficult
65
Contacting People-Extremely Difficult - 2 1.Robo-calls (auto-dialed) calls to cellphones NOT ALLOWED 2.To obtain between 700 and 1,000 cellphone interviews when response rate is 8%, approx. 10,000 cellphone numbers must be manually dialed – budget buster!
66
Non-Probability Sampling H The high cost of obtaining data has driven survey firms to the internet. H Non-probability sampling: participants are chosen or choose themselves so that the chance of being selected is not known. –Major problems with internet polls –No one has figured out how to select a representative sample of internet users
67
85% of US Adults Use the Internet * Blogs (e.g. Blogger, Wordpress), Microblogs (e.g. Twitter), Social networking (e.g. Facebook), Content sharing/discussion (YouTube, Reddit)
68
Non-Probability Sampling: Opt-In Online Panels H YouGov https://today.yougov.com/about/about-the- yougov-panel/
69
Non-Probability Sampling: Many Online Data-Gathering Services (Free, Pay) H Google Consumer Surveys Google Consumer Surveys H Google Trends Google Trends H Google Analytics Google Analytics H Twitter Analytics Twitter Analytics H Facebook Analytics Facebook Analytics H Microsoft Microsoft H Yahoo Yahoo H Amazon Amazon
70
Example: ViralHeat (fee-based)
71
The Billion Prices Project @ MIT H http://bpp.mit.edu/ http://bpp.mit.edu/ Aggregates millions of daily e-commerce transactions into a real-time price index for US, China, and ten other countries
72
End of Lecture Unit 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.