Sampling Chapter 5.

Sampling Chapter 5

Sample A sub-set of the units in a population, expected to represent the whole population By measuring data on a sample Information on the entire population is gathered at a lower cost compared to censuses Some margin of error is necessarily accepted

Probability vs. non-probability sampling
Probability sampling: each unit in the sampling frame is associated to a given probability of being included in the sample, which means that the probability of each potential sample is known Non-probability sampling: extraction of sample units is not based on probability rules

Inference and sampling error
Prior knowledge on the probability of all potential samples allows statistical inference Statistical inference is the generalization of sample statistics (parameters) to the target population, subject to a margin of uncertainty, or sampling error. The probability laws ruling probability sampling allow one to ascertain how much the sample estimates reflect the true characteristic of the target population The sampling error can be estimated and used to assess the precision of sample estimates The sampling error is only a portion of the survey error (which also includes non-sampling error), but has the advantage that it can be estimated and controlled using the information on the sampling method

Characteristics and limits of non-probability sampling
The use of non-probability samples is common in marketing research, especially quota sampling (discussed later) It can be argued that the sampling error is often much smaller than error from non-sampling sources However, the problems with non-probability samples are that: Selection of the sampling units is subjective it is impossible to assess scientifically the ability to avoid the potential biases of a non-probability sample Sampling error, precision and accuracy cannot be estimated Statistical methods to analyze sample data are based on probability assumptions (e.g. normality of the data distribution), which can be only be determined by probability extraction rules

Sampling concepts The sampling error depends on:
The sampling fraction (ratio between sample size and population size) Larger sampling fractions increase precision However, the gain in precision decreases as the sampling fraction increases The data variability in the population The less variable the population data, the more precise the sample estimates However, population variability is rarely known and generally estimated The precision of the sample estimator Precision of an estimator (measured through the standard error of the estimator) is the variability of an estimate across multiple measurements (across different samples) The data variability in the population. If the target variable has a large dispersion around the mean, it is more likely that the computed sample statistics are distant from the true population mean, whereas if the variability is small even a small sample could return very precise statistics. Note that the concept of precision refers to the degree of variability and not to the distance from the true population value (which is accuracy) The population variability is measured by the population variance and standard deviation (see Appendix). Obviously, these population parameters are usually unknown as their knowledge would require that the mean itself is known, which would make the sampling process irrelevant. Estimates of the population variability are obtained by computing variability statistics on the sample data, such as the sample standard deviation and sample variance. Finally, the success of the sampling process depend on the precision of the sample estimators. This is appraised by variability measures for the sample statistics and should not be confused with the sample variance and standard deviation. In fact, the objective of measurement is not the data variability any more, but rather the variability of the estimator (the sample statistic) intended as a random variable distributed around the true population parameter across the sample space. For examples, if the researcher is interested in the population mean value and a mean value is computed on sample data, the precision of such estimate can be evaluated through the variance of the mean or its square root, the standard error of the mean. The distinction between standard deviation and standard error should be apparent if we think that a researcher could estimate the population standard deviation on a sample and the measure of the accuracy of the sample standard deviation will be provided by a statistic called standard error of the standard deviation. Note that a precise sampling estimator is not necessarily an accurate one, although the two concepts are related. Accuracy measures closeness to the true population value, while precision refers to the variability of the estimator. For example, a sample mean estimator is more accurate than another when its estimated mean is closer to the true population mean, while it is more precise if its standard error is smaller. Accuracy is discussed in section

Standard deviation vs. standard error
The standard deviation measures the variability of a given variable (e.g. X) within the population or sample The standard error is a precision measure which refers to the variability of the sample estimator (e.g. the sample mean) across multiple estimates The standard error depends on the standard deviation but they are not the same concept

Accuracy and precision
Accuracy: the degree to which the sample estimate is close to the true population value – maximum accuracy is obtained when the estimate equals the true population value Precision: the variability of the sample estimate in repeated measurements (across different samples) – maximum precision is obtained when the estimate is the same across all samples The standard error of an estimator is a measure of precision

Samples and population: some terminology
Target population: the set of units which are the object of the research in from which the sample is extracted. The characters to be measured in the population are usually called parameters Sampling frame: a list of the population units Sample size: the number of units (n) in the sample Sample statistic: an estimate of the population parameter based on the sample observations Sampling distribution: the probability distribution of the sample statistics around the true population value

The indirect problem and inference
If one extracted all potential samples from a population (the sampling space) then the sampling distribution would be exactly known However, this would be a quite stupid exercise – given that the true population parameter would be already known Thus, statisticans are interested in the indirect problem Only one sample is extracted Only the sample statistics are known The sampling distribution is not known exactly, but it can be ascertained from the probabilistic sampling method Given the sampling distribution and the sample statistics, one obtains estimates of the true population parameters through statistical inference

Example – sample mean Extract two elements from a population of four
POPULATION: A=4; B=1; C=3; D=4 – Pop. Average=3 SAMPLING SPACE: AB – Sample mean= 2.5 AC – 3.5 AD – 4 BC – 2 BD – 2.5 CD – 3.5

Example – the sampling distribution
The average of all sample means in the sampling space is three (equal to the true population mean) None of the extracted samples exactly reflects the population (none has a mean of 3) The mean absolute error which we commit by observing only two out of four population units is – this is a direct measure of sampling error

Example – the indirect problem
Suppose we have observed only one sample which is extracted randomly The probability extraction method and the sample observations allow us to Obtain an estimate of the population mean Obtain a precision estimate (the sampling error) By combining the sample estimate with the sampling error, one can draw inference on the true population value, for example by defining a bracket which is likely to include the true value

Example results Suppose we extract the sample AB with a simple random sampling The sample mean is 2.5 The mean error within the sample is 0.5 With a very rough (and inexact) assumption (that the mean error within the sample reflects the sampling error), we might claim that the true population value lies between [ ] and [ ], that is between two and three This is a rough example, but with large samples and probability theory, knowledge based on a single sample can lead to accurate conclusions on the whole population, accounting for sampling error

In practise… One sample is extracted
Sample means and sample standard deviation are obtained An estimate of precision is obtained through an estimate of the standard error of the mean, which is a function of the sample standard deviation and the sample size Using the sample mean and the measure of precision one draws conclusion on the population mean (see lecture 6)

Population parameters (in a population of N elements)
Mean Variance Standard deviation

Sample statistics Sample mean Sample variance
Sample standard deviation unbiasedness

Simple random sampling
Each element of the population has a known and equal probability of selection Every element is selected independently from other elements The probability of selecting a given sample of n elements is computable (known) The Central Limit Theorem guarantees that for simple random samples with sample size (n) sufficiently large (>40), the sampling distribution of the sample mean follows the normal distribution

The normal distribution (again)
Recall the curve of measurement error With simple random sampling the sample means follow the same probability distribution

Basic SRS sample statistics (unknown pop. variance)
Mean case Proportion case (p) Sample standard deviation of X Standard error of the mean/proportion PRECISION of sample estimates

Precision of estimators and sample size
The standard error increases with higher population variances and decreases with larger sample sizes However, the relative gain in precision decreases as sample size increases Very large sample sizes are not convenient, because the gain in precision is very small and the increase in costs is very large

Accuracy and confidence level (1)
Suppose the sampling distribution is normal (as for simple random sampling) The confidence level a (further discussed in lecture 6) is the probability that the relative difference between the estimated sample mean and the true population mean is larger than a given relative accuracy level r: is the population mean and 1-a is the level of confidence

Accuracy and confidence level (2)
The confidence level is chosen by the researcher For example suppose we want a 95% confidence level – what is the value of the relative accuracy r? In other words, if we extracted 100 different samples, in only 5 out of 100 would we commit a relative error larger than r Relative accuracy – expressed in %age terms with respect to the population mean

Example – relative accuracy in SRS
Suppose we set a=0.05 (confidence level of 95%) The equation to compute relative accuracy with simple random sampling is the following: Accuracy depends on: Sample size Population size Standard error of the mean A constant value (ta/2) which depends on the confidence level and the sample size

Relative accuracy, sample size and population size
For larger population sizes it is not necessary to increase sample size A sample size of 500 guarantees an error below 5% for any population size Above a size of 500, it is better to consider spending money on reducing non-sampling errors

Determining sample size
Factors influencing sample size (n) Size of the population (N) Variability of the population (s) Desired level of accuracy (sx or r) Level of confidence (a) Budget constraint

Simple random sampling – determining sample size
Determining sampling size for a given relative error needs to be estimated (or conservative assumptions can be made) r is the relative level of precision t as before, is a constant which depend on a and on the sample size

The sampling design process
Define the target population its elements and the sampling units Determine the sampling frame (list) Select a sampling technique Determine the sample size Execute the sampling process

Selection bias Improper selection of sample units (ignoring a relevant control variable) so that the values observed in the sample are biased and the sample is not representative. Some units have a higher probability of being selecte, without acknowledging this in the sampling process. If the units with higher inclusion probabilities have specific characteristics that differ from the rest of the population – as it is often the case – sample measurement will suffer from a significant bias. Example: A survey is conducted for measuring goat milk consumption, but the interviewers just select people in urban areas that on average drink less goat milk.

The sampling techniques
Probability sampling Simple random sampling Systematic sampling Stratified sampling Cluster sampling Complex sampling techniques Non-probability sampling Convenience sampling Judgmental sampling Quota sampling Snowball sampling

Simple random sampling
Each element of the population has a known and equal probability of selection Every element is selected independently from other elements The probability of selecting a given sample of n elements is computable (known) Statistical inference is possible It is easily understood Representative samples are large and expensive Standard errors are larger than in other probabilistic sampling techniques Sometimes it is difficult to execute a really random sampling

Systematic sampling A list of N elements in the population is compiled and ordered according to a specified variable Unrelated to the target variable (similar to SRS) Related to the target variable (increased representativeness) A sampling size n is chosen A systematic step of k=N/n is set A random number s between 1 and N is extracted and represents the first element to be included Then the other elements selected are s+k, s+2k, s+3k… Cheaper and easier than SRS More representative if order is related to the interest variable (monotone) Sampling frame not always necessary Less representative (biased) if the order is cyclical

Stratified sampling Population is partitioned in strata through control variables (stratification variables), closely related with the target variable, so that there is homogeneity within each stratum and heterogeneity between strata A simple random sampling frame is applied in each strata of the population Proportionate sampling – size of the sample from each stratum is proportional to the relative size of the stratum in the total population Disproportionate sampling: size is also proportional to the standard deviation of the target variable in each stratum Gains in precision Include all relevant subpopolation even if small Stratification variables may not be easily identifiable Stratification can be expensive

Post-stratification Typical obstacle to stratified sampling: unavailability of a sampling frame for each of the strata It may be useful to proceed through simple random sampling and exploit the stratified estimator once the sample has been extracted, which increases efficiency. All that is required is the knowledge of the stratum sizes in the population and that such post-stratum sizes are sufficiently large. The advantage of post-stratifications is two-fold:

Applying post-stratification(PS)
It allows to correct the potential bias due to insufficient coverage of the survey (incomplete sampling frame) PS allows one to correct the missing responses bias, provided that the variable is related both to the target variable and to the cause of non-response It is carried out by extracting a Simple Random Sample (SRS) of size n and then classifying units into strata. Instead of the usual SRS mean, a PS estimator is computed by weighting the means of the sub-groups by the size of each sub-group. The procedure is identical to the one of stratified sampling and the only difference is that the allocation into strata is made ex post. The standard error for the PS mean estimator is larger than the stratified sampling one, because additional variability is given by the fact that the sample stratum sizes are themselves the outcome of a random process.

Cluster sampling The population is partitioned into clusters
) Elements within the cluster should be as heterogeneous as possible with respect to the variable of interests (e.g. area sampling) A random sample of clusters is extracted through SRS (with probability proportional to the cluster size) 2a. All the elements of the cluster are selected (one-stage) 2b. A probabilistic sample is extracted from the cluster (two-stage cluster sampling) Less precision Inference can be difficult Reduced costs Higher feasibility

Complex sampling designs
Combination of different sampling methods to increase efficiency or reduce costs Two-stage sampling: two different sampling units, where the second-stage sampling units are a sub-set of the first-stage ones. Typically in household surveys a sample of cities or municipalities is extracted in the first- stage while in the second stage the actual sample of households is extracted out of the first-stage units. Any probability design can be applied within each stage. For example, municipalities can be stratified according to their populations in the first stage to ensure that the sample will include small and rural towns as well as large cities, while in the second stage one could apply area sampling, a particular type of cluster sampling where: each sampled municipality is subdivided into blocks on a map through geographical coordinates; blocks are extracted through simple random sampling; all households in a block are interviewed.

Probability sampling with SPSS (a)

Probability sampling with SPSS (b)

Sampling with SAS SAS\STAT component procedures for the extraction of samples and statistical inference Proc SURVEYSELECT allows one to extract probability-based samples Proc SURVEYMEANS computes sample statistics taking into account the sample design Proc SURVEYREG estimates sample-based regression relationships

Non-probability sampling
Non-probability sampling does not allow one to accompany sample estimates with evaluations of their precision and accuracy Still, non-probability sampling is a common practice in marketing research, especially quota sampling. It is not necessarily biasing or uninformative In some circumstances – for example when there is no sampling frame – it may be the only viable solution Key limit – in general, techniques for statistical inference cannot be used to generalize sample results to the population

Convenience sampling Only convenient elements enter the sample
Cheapest method Quickest method Selection bias Non representativeness Inference is not possible

Judgmental sampling Selection based on the judgment of the researcher
Low cost Quick Non -representativeness Inference is not possible Subjective (potential selection bias)

Quota sampling Define control categories (quotas) for the population elements, such as sex, age… Apply a restricted judgmental sampling so that quotas in the sample are the same of those in the population Cheapest method Quickest method There is no guarantee that the sample is representative (relevance of control characteristic chosen) Many sources of selection bias No assessment of sampling error

Snowball sampling A first small sample is selected randomly
Respondents are asked to identify others who belong to the population of interests The referrals will have demographic and psychographic characteristics similar to the referrers Lower costs Low variability Useful for rare populations Inference is not possible

Sampling Chapter 5.

Similar presentations

Presentation on theme: "Sampling Chapter 5."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling Chapter 5.

Similar presentations

Presentation on theme: "Sampling Chapter 5."— Presentation transcript:

Similar presentations

About project

Feedback