4.1 (Day 2) 10.10.2017
Other Sampling Methods Last week we talked about simple random samples Simplest way to eliminate bias Sometimes, however, there are factors that make more complex sampling methods more appealing Taking an SRS is too difficult/expensive We are not convinced that an SRS will actually give is a representative sample
Stratified Random Sample Probably the most commonly used sampling method Instead of sampling from the entire population, you divide the population into sub-groups (“strata”) You then randomly sample from each of these strata Combine these “sub-samples” together to form your overall sample
Stratified Random Sampling Taking an SRS is too difficult/expensive We are not convinced that an SRS will actually give is a representative sample Which of these problems is Stratified Random Sampling fixing?
Stratified Random Sampling Taking an SRS is too difficult/expensive We are not convinced that an SRS will actually give is a representative sample Which of these problems is Stratified Random Sampling fixing? #2 Sidenote: do NOT abbreviate stratified random sampling as SRS SRS always means simple random sample
Stratified Random Sampling When is it most commonly used? Commonly used to make sure that important groups are all represented Income Race/ethnicity Gender Age Etc.
Downsides to Stratified Random Sampling
Downsides to Stratified Random Sampling Harder/more expensive than SRS You have to know in advance who falls into which strata You have to know how many to sample from each strata to make it representative
Cluster Sampling Cluster sampling is a solution to the other problem: that SRS can be expensive/difficult Particularly if your population is large and/or very spread out, it is difficult to use SRS or stratified random sampling These methods are also very difficult to use if you cannot easily identify each individual in your population Think of the trees in the forest
Cluster Sampling to the Rescue The procedure for carrying out a cluster sample is similar to a stratified random sample First divide the population into smaller groups (or “clusters”) So far this sounds exactly like stratified random sampling But now we are not dividing into different groups to try to make it representative, but instead we are dividing into groups that can be readily found together Homerooms Neighborhoods 50x50 sections of the forest
Cluster Sampling to the Rescue But now we are not dividing into different groups to try to make it representative, but instead we are dividing into groups that can be readily found together Homerooms Neighborhoods 50x50 sections of the forest Then we randomly select which clusters to sample Typically, once a cluster is selected, ALL individuals in the cluster are included in the sample Occasionally you may choose to then take an SRS within each cluster
Cluster Sampling As with stratified random sampling, we then combine the clusters back together to form our overall sample So cluster sampling is NOT necessarily making our sample more representative than an SRS But it is, in some cases, making it easier to perform and more cost/time effective
The Stratified/Cluster Sample Hybrid On a test, a given method will clearly be an SRS, a stratified random sample, or a cluster sample But in practice, there is some overlap between cluster samples and stratified random samples You may even use such a hybrid later this semester
The Stratified/Cluster Sample Hybrid But in practice, there is some overlap between cluster samples and stratified random samples You may even use such a hybrid later this semester For example, if my population is citizens of the United States, it would be too difficult to take an SRS So I decide to use a cluster sample to make it easier But I want to make sure that people in both smaller cities/towns (under 40,000 people) and bigger cities (over 40,000 people) are represented
The Stratified/Cluster Sample Hybrid For example, if my population is citizens of the United States, it would be too difficult to take an SRS So I decide to use a cluster sample to make it easier But I want to make sure that people in both smaller cities/towns (under 40,000 people) and bigger cities (over 40,000 people) are represented So I decide to randomly sample individual from 10 big cities, and 5 smaller cities
The Stratified/Cluster Sample Hybrid So I decide to randomly sample individual from 10 big cities, and 5 smaller cities Notice that I have clustered by city But also stratified by size of the city So this is not a pure stratified random sample, and not a pure cluster sample—it is a hybrid Your book (as far as I know) does not really talk about this, but I just wanted you to be aware of it
Inference Why random sampling? Let’s do an activity I have 3 decks of cards One at a time, come shuffle a deck, then pick 5 cards from a deck Then put them back in the deck Then record on the (left) dotplot what percentage were red What does it look like?
Inference Why random sampling? Let’s do an activity I have 3 decks of cards Now pick 10 cards Then record on the (middle) dotplot what percentage were red What does it look like?
Inference Why random sampling? Let’s do an activity I have 3 decks of cards Now pick 20 cards Then record on the (right) dotplot what percentage were red What does it look like?
Inference What did this activity tell us? What does it imply for random sampling?
Random Sampling Random sampling works so well because each individual is equally likely to be chosen Each card was equally likely to be chosen As the sample gets larger, it comes closer and closer to representing the entire population
Margin of Error Each sample has a margin of error Sets bounds for the likely potential range of the true population value based on the sample
Sample Surveys: What Can Go Wrong? Undercoverage occurs when some groups in the population are left out of the process of choosing the sample. Nonresponse occurs when an individual chosen for the sample can’t be contacted or refuses to participate. A systematic pattern of incorrect responses in a sample survey leads to response bias. The wording of questions is the most important influence on the answers given to a sample survey.
Wording of the Question “Since they are breaking the law, should illegal immigrants be deported?” “Should illegal immigrants be deported?” “Despite holding jobs and helping the US economy, should illegal immigrants be deported?”
Order of the Questions A series of two questions were asked to college students as part of a larger survey. The order of the two questions was randomly assigned Order #1: “How happy are you with your life in general? (on a scale of 1 to 5)” “How many dates have you been on in the past month?” This order found almost no correlated between happiness and number of dates
Order of the Questions Order #2: “How many dates have you been on in the past month?” “How happy are you with your life in general? (on a scale of 1 to 5)” This order found a moderately strong positive correlation between happiness and number of dates Sometimes the answer to one question changes the answer to another—so even the order can matter
Surveys Moral of the story: There are lots of ways to create bias in a survey You will explore these in the first semester project In general, if you want unbiased results: Be careful with your wording—make sure that the questions are as neutral as possible Try not to put questions near each other that would affect the answer to the other Make your sample as representative as possible It is HARD to know for sure that you’re not getting bias—but you can take precautions It is EASY to intentionally create bias But this makes it more PROPAGANDA than statistics
The End