STAT 111 Introductory Statistics Lecture 4: Collecting Data May 24, 2004.

Slides:



Advertisements
Similar presentations
+ Sampling and Surveys Inference for Sampling The purpose of a sample is to give us information about alarger population. The process of drawing conclusions.
Advertisements

1 Important Terms Variable – A variable is any characteristic whose value may change from one individual to another A univariate data set consists of.
Chapter 7: Data for Decisions Lesson Plan
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
Chapter 5 Producing Data
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
QBM117 Business Statistics Statistical Inference Sampling 1.
3.2 Sampling Design. Sample vs. Population Recall our discussion about sample vs. population. The entire group of individuals that we are interested in.
AP Statistics Chapter 5 Notes.
The Practice of Statistics
Section 5.1. Observational Study vs. Experiment  In an observational study, we observe individuals and measure variables of interest but do not attempt.
Sample Surveys Ch. 12. The Big Ideas 1.Examine a Part of the Whole 2.Randomize 3.It’s the Sample Size.
Chapter 1 Getting Started
Chapter 5 Data Production
Chapter 4 Gathering data
Copyright © 2010 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
Experiments and Observational Studies. Observational Studies In an observational study, researchers don’t assign choices; they simply observe them. look.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 13 Experiments and Observational Studies.
BPS - 5th Ed. Chapter 81 Producing Data: Sampling.
Sampling is the other method of getting data, along with experimentation. It involves looking at a sample from a population with the hope of making inferences.
AP Statistics.  Observational study: We observe individuals and measure variables of interest but do not attempt to influence responses.  Experiment:
Aim: What is a sample design? Chapter 3.2 Sampling Design.
Part III Gathering Data.
Collection of Data Chapter 4. Three Types of Studies Survey Survey Observational Study Observational Study Controlled Experiment Controlled Experiment.
Chapter 5: Producing Data “An approximate answer to the right question is worth a good deal more than the exact answer to an approximate question.’ John.
Chapter 7: Data for Decisions Lesson Plan Sampling Bad Sampling Methods Simple Random Samples Cautions About Sample Surveys Experiments Thinking About.
Section 5.1 Designing Samples Malboeuf AP Statistics, Section 5.1, Part 1 3 Observational vs. Experiment An observational study observes individuals.
Data Collection: Sample Design. Terminology Observational Study – observes individuals and measures variables of interest but does not impose treatment.
Designing Samples Chapter 5 – Producing Data YMS – 5.1.
AP Review #4: Sampling & Experimental Design. Sampling Techniques Simple Random Sample – Each combination of individuals has an equal chance of being.
Conducting A Study Designing Sample Designing Experiments Simulating Experiments Designing Sample Designing Experiments Simulating Experiments.
Lecture # 6:Designing samples or sample survey Important vocabulary Experimental Unit: An individual person,animal object on which the variables of interest.
C HAPTER 5: P RODUCING D ATA Section 5.1 – Designing Samples.
Section 5.1 Designing Samples AP Statistics
BY: Nyshad Thatikonda Alex Tran Miguel Suarez. How to use this power point 1) Click on the box with the number. Best to click on the black part and not.
AP STATISTICS LESSON AP STATISTICS LESSON DESIGNING DATA.
CHAPTER 6: Two-Way Tables. Chapter 6 Concepts 2  Two-Way Tables  Row and Column Variables  Marginal Distributions  Conditional Distributions  Simpson’s.
AP STATISTICS Section 5.1 Designing Samples. Objective: To be able to identify and use different sampling techniques. Observational Study: individuals.
Chapter 5 Sampling: good and bad methods AP Standards Producing Data: IIB4.
1 Data Collection and Sampling Chapter Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results.
1 Introduction to Statistics. 2 What is Statistics? The gathering, organization, analysis, and presentation of numerical information.
 An observational study observes individuals and measures variable of interest but does not attempt to influence the responses.  Often fails due to.
Chapter 3 Producing Data. Observational study: observes individuals and measures variables of interest but does not attempt to influence the responses.
Chapter 7 Data for Decisions. Population vs Sample A Population in a statistical study is the entire group of individuals about which we want information.
1. What is one method of data collection? 2. What is a truly random way to survey/sample people?
1 Data Collection and Sampling ST Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical.
Status for AP Congrats! We are done with Part I of the Topic Outline for AP Statistics! (20%-30%) of the AP Test can be expected to cover topics from chapter.
Designing Studies In order to produce data that will truly answer the questions about a large group, the way a study is designed is important. 1)Decide.
1 Chapter 11 Understanding Randomness. 2 Why Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:
Chapter 3 Generating Data. Introduction to Data Collection/Analysis Exploratory Data Analysis: Plots and Measures that describe a set of measurements.
Plan for Today: Chapter 1: Where Do Data Come From? Chapter 2: Samples, Good and Bad Chapter 3: What Do Samples Tell US? Chapter 4: Sample Surveys in the.
5.1: Designing Samples. Important Distinction Observational Study – observe individuals and measure variables but do not attempt to influence the responses.
Chapter 2 The Data Analysis Process and Collecting Data Sensibly.
Introduction/ Section 5.1 Designing Samples.  We know how to describe data in various ways ◦ Visually, Numerically, etc  Now, we’ll focus on producing.
MATH Section 6.1. Sampling: Terms: Population – each element (or person) from the set of observations that can be made Sample – a subset of the.
Producing Data 1.
Chapter 5 Data Production
Arrangements or patterns for producing data are called designs
Principles of Experiment
Section 5.1 Designing Samples
Arrangements or patterns for producing data are called designs
Producing Data Chapter 5.
Daniela Stan Raicu School of CTI, DePaul University
Daniela Stan Raicu School of CTI, DePaul University
Section 5.1 Designing Samples
Chapter 5: Producing Data
Chapter 5: Producing Data
Sample Design Section 4.1.
Chapter 3 producing data
Designing Samples Section 5.1.
Presentation transcript:

STAT 111 Introductory Statistics Lecture 4: Collecting Data May 24, 2004

Today’s Topics Relationships between categorical variables Collecting Data –Designing experiments –Choosing a sample –Sampling distributions

Categorical Variables Recall that categorical variables separate individuals into groups. We’ve seen that to see relationships between quantitative variables, we use scatterplots. Similarly, to see relationships between categorical explanatory variables and quantitative responses, side-by-side boxplots are quite useful. What do we use to see the relationship between two categorical variables, though?

Contingency Table The contingency table is a two-way table with one variable as the row variable and the other as the column variable. The row totals and column totals in a two-way table give the marginal distributions of two variables separately. Conditional distribution of the response variable for each category of the explanatory variable could be used to describe the association between the two variables.

Contingency Table Example 1 Titanic data – 2201 passengers, only the counts SEX female male SURVIVED Total Count no yes Total Column variable Row variable

Joint and Marginal Distributions Joint Distribution Marginal distribution of SURVIVED Marginal distribution of SEX

Conditional Distributions Conditional distribution of survival given gender Conditional distribution of gender given survival

Joint distribution: –P( Male surviving ) = 16.67% –P( Female surviving ) = 15.63% Marginal distribution: –P( Surviving ) = 32.30% –P( Male ) = 78.65% Conditional distribution: yesno Survival73.19%26.81% Given a female yesno Survival21.20%78.80% Given a male Example from Contingency Table 1

We see that of the people on board the ship, female survivors and male survivors made up roughly the same percentage. But the number of females on board was substantially smaller than the number of males. Looking at each category, we see that the percentage of females that survived is higher than the percentage of males that survived. Survival and gender seem to be associated.

Lurking Variables We know that lurking variables can produce nonsensical relationships between two quantitative variables. Does the same hold true for relationships between categorical variables? Example – We have the number of delayed and on-time flights for two airlines, Alaska Airlines (AA) and America West (AW). Which one has more flights that leave on-time?

Lurking Variables (cont.) Airline AA AW Status Count Row % delay on-time Looking at the contingency table below, it looks like America West has a larger percentage of on- time flights. But…

Lurking Variables (cont.) Let’s look at the data for the individual cities. Los AngelesPhoenixSan Diego SeattleSan Francisco

Lurking Variables (cont.) For each individual city, the percentage of flights that are on-time is higher for Alaska Airlines than it is for America West. On the other hand, the percentage of flights that are on-time is higher for America West than for Alaska Airlines when we look at the aggregate. What’s going on here?

Lurking Variables (cont.) An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is Simpson’s paradox. Simpson’s paradox is an extreme form of the fact that observed associations can be misleading in the presence of lurking variables. Our case is an example of Simpson’s paradox, so what is the lurking variable here?

Lurking Variables (cont.) The lurking variable here is the city, and in particular, the weather of that city. Of the five cities listed, Seattle has the worst weather, so flights tend to be more delayed in this airport. Phoenix, on the other hand, is not plagued with bad weather, so flights tend to be more on-time. Most of Alaska Airline’s flights involve Seattle, whereas America West’s flights mostly involve Phoenix!

Contingency Tables – Wrap-up Most often, the contingency tables you’ll see will be of categorical variables with two levels each. Naturally, we can extend this to categorical variables with more than two levels. Also, we can consider a contingency table involving three variables; what we do in this case is create a series of contingency tables involving only the first two variables, one table for each of the levels of the third variable.

Collecting Data We’ve discussed previously the idea of exploratory data analysis. –“What do we see in our data?” Formal statistical inference is another type of data analysis. –Here, we are more interested in answering specific questions with a known degree of confidence. Either way, successful statistical analysis requires our data to be both reliable and accurate.

Collecting Data (cont.) The reliability and accuracy of our data depend on the method we use to collect our data. This method is known as a design. Some popular sources of data are –Available data from libraries and the internet (Available data are data that were produced in the past for some other purpose but that may help answer a present question.) –Observational studies –Experimental studies

Observational vs Experimental Studies In an observational study, we observe individuals and measure variables of interest, but we do not attempt to influence the responses. In an experiment, we deliberately impose some treatment on individuals in order to observe their responses. An observational study is generally poor at gauging the effect of an intervention, but in many situations, we have to use an observational study.

Sample Surveys The sample survey is one specific type of observational study. Why is it preferred to a census? –Financial constraints –Time A sampling survey can be conducted using –Personal interviews –Telephone interviews –Self-administered questionnaires

Experiments Experimental units: individuals on which our experiment is conducted Subjects: human experimental units Treatment: specific experimental condition applied to our units In principle, experiments can give good evidence of causation.

Principles in Designing Experiments Control the effects of lurking variables on the response; easiest way to do this is by comparing two or more treatments. This can help reduce the bias in a study. Randomize – use chance to assign experimental units to treatments. Replicate each treatment on many units to reduce chance variation in the results.

More on Experiments In an experiment, we hope a difference in the responses so large that it is unlikely to happen because of chance variation alone. In other words, we are looking for a statistically significant effect. This terms frequently appears in reports of studies and tells you that the investigators found good evidence for the effect they were seeking. The most serious weakness of experiments, though, is their lack of realism.

Types of Experimental Designs Completely randomized design: experimental units are allocated at random among treatments. Simplest design for experiments. Block design: blocks of experimental units are formed; random assignments of units to treatments is carried out separately within each block. Matched pairs design: special type of block design that compares only two treatments by choosing blocks of two units that are as closely matched as possible.

Review: Population vs Sample Population: the entire group of individuals that we want information about Sample: the part of the population we actually examine in order to gather information Parameter: a value that describes the population. It is fixed, but generally unknown. Statistic: a value that describes the sample. It is observed once a sample is obtained and can be used to estimate an unknown parameter. We generally require that the sample be a good representative of the population.

Sampling Designs Voluntary response sample –Biased sample scheme scheme Simple random sample Stratified random sample Cluster sample (one-stage and two-stage)

Sampling Designs A voluntary response sample consists of people who choose themselves by responding to a general appeal. This type of sample is invariably biased (contains a systematic error) and is not usually representative of the general population. Why? The people who are willing to respond are the only ones included in this sample, and usually those are the ones with very strong opinions. So what we get are the extreme cases.

Sampling Designs (cont.) Better sampling designs choose individuals by random chance so that the bias is eliminated. A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected. How do we select an SRS? –Assign a number to each individual in the population. –Randomly select sample numbers by using a random numbers table or software package.

Sampling Designs (cont.) A probability sample is a sample chosen by chance and is the general framework for designs that use chance to choose a sample. Possible samples and the probability of each possible sample occurring must be known. The SRS is the simplest type of probability sample; it gives each member of the population an equal chance of selection. More complex designs are better for sampling from large populations.

Sampling Designs (cont.) To select a stratified random sample, divide the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample. Sex Male Female Age under Martial status Married Single

Sampling Designs (cont.) We typically choose the strata based on facts we know prior to taking the sampling. Strata for sampling are similar to blocks in experiments. Overall, using a stratified random sample, we can acquire information about –The whole population –Each stratum –The relationships among the strata

Sampling Design (cont.) The SRS and stratified random sample both select individuals from the population. On the other hand, the cluster sample selects groups or clusters of individuals from the population. A cluster is also referred to as a primary sampling unit (PSU). In a one-stage cluster sample, all individuals within the selected clusters are selected. In a two-stage cluster sample, a SRS of the individuals within each selected cluster is drawn.

Sampling Designs (cont.) A two-stage cluster sample is an example of a multistage sampling design. This is a more complex design in which, as the name suggests, a sample is obtained by sampling in multiple stages. Basically, any sort of combination of an SRS, stratified random sample, and cluster sample can create a multistage sample.

Errors – Non-sampling vs Sampling Non-sampling errors occur due to mistakes made during the process of data acquisition. Increasing sample size will not reduce this type of error. There are three types of non-sampling errors: –Errors in data acquisition, e.g., response bias –Nonresponse errors –Selection bias, such as undercoverage

Error in Data Acquisition If this observation… …is wrongly recorded here… Sampling error + Data acquisition error Population Sample

Population Sample No response here...…may lead to biased results here. Nonresponse Error

Selection Bias Population Sample When parts of the population cannot be selected... …the sample cannot represent the whole population.

Sampling Error Sampling error refers to differences between the sample and the population, because of the specific observations that happen to be selected. Sampling error is expected to occur when making a statement about the population based on the sample taken.

Population Sample The sample mean Population mean Sampling error

Sampling Distributions The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. The bias of a statistic is the difference between the mean of its sampling distribution and the population parameter; no bias = unbiased. The variability is described by the spread of its sampling distribution; determined by the design and size of the sample.

High bias, low variabilityLow bias, high variability High bias, high variabilityLow bias, low variability

More on Sampling Errors We are often concerned with how to manage the bias and variability of a statistic. To reduce the bias, we use random sampling. –Generally speaking, estimates drawn from an SRS are unbiased (which is why the SRS is so attractive). To reduce the variability of a statistic from an SRS, increase the sample size. There is a trade-off between bias and variability, however (i.e., we cannot make both very small).