The sampling distribution of a statistic The basic idea is that we can characterize the distribution of the statistic in all possible samples. If we draw a single random sample we usually use it to draw inferences about a statistic (e.g. the mean) based on what (theoretically) would happen if we drew lots of (infinitely many) random samples. This distribution of values of a statistic in all possible samples is the sampling distribution of the statistics. Sometimes theoretical arguments lead to a mathematical representation of the sampling distribution but, in general, we can't enumerate the values of the statistic in all possible samples.
Sampling distribution of the mean of a sample. : . Sampling distribution of the mean of a sample.
Monte Carlo Estimates Simulation allows us to actually randomly draw multiple samples and examine empirically what occurs in these samples. We can calculate p-values, standard errors, confidence intervals, etc. based on the multiple samples.
The process for simulating the sampling distribution for some statistic: Generate multiple random samples Compute the statistic for each sample The collection of these calculated statistics provides an approximate sampling distribution (ASD). Analyze the ASD to draw conclusions.
Simulation using the Data Step Now we have to draw multiple samples of a given size. The “classic” example is the sampling distribution of the mean.
Sampling Distribution of the Mean, uniform(0,1) 1 Sampling Distribution of the Mean, uniform(0,1) 1. Generate Random samples. %let obs = 10; /* size of each sample */ %let reps = 1000; /* number of samples */ %let seed=54321; data SimUni; call streaminit(&seed); do rep = 1 to &reps; do i = 1 to &obs; x = rand("Uniform"); output; end; run; Note the order of the do loops. We will use by group processing so this order saves a sort step. A good habit is to start with a small number of reps (I usually use 10) and check the code.
2. Compute mean for each sample proc means data=SimUni noprint; by rep; var x; output out=OutUni mean=MeanX; run; proc print data=outuni(obs=10);run; This could also be done with sql
3. Analyze ASD: summarize and create histogram proc means data=OutUni N Mean Std P5 P95; var MeanX; run; proc univariate data=OutUni; label MeanX = "Sample Mean of U(0,1) Data"; histogram MeanX / normal; ods select Histogram moments goodnessoffit; These are our simulated estimates of mean values from a sample of 10 independent U(0,1).
Examine Percentiles proc univariate data=OutUni noprint; var MeanX; output out=Pctl95 N=N mean=MeanX pctlpts=2.5 97.5 pctlpre=Pctl; run; proc print data=Pctl95 noobs; Univariate allows estimating custom percentiles
Estimate Probabilities from ASD, e. g Estimate Probabilities from ASD, e.g. what is the probability the mean of a sample >.7 proc sql; select sum(meanx>.7)/count(*) as prob from outuni; quit; Things like this are often of interest, e.g. , you got a mean of .9, what is the probability of this occurring if the mean is 0?
Sampling Distribution of statistics from normal data
1. Simulate data %let obs = 31; %let rep = 10000; %let seed=54321; data Normals(drop=i); call streaminit(&seed); do rep = 1 to &reps; do i = 1 to &obs; x = rand("Normal"); output; end; run;
2. Compute statistics for each sample proc means data=Normals noprint; by rep; var x; output out=StatsNorm mean=SampleMean median=SampleMedian var=SampleVar; run;
3. Analyze Approximate Sampling Distribution 3. Analyze Approximate Sampling Distribution. Calculate variances of sampling distribution for mean and median proc means data=StatsNorm Var; var SampleMean SampleMedian; run;
3. Analyze Approximate Sampling Distribution 3. Analyze Approximate Sampling Distribution. Plot kernel density estimates. proc sgplot data=StatsNorm; title "Sampling Distributions of Mean and Median for N(0,1) Data"; density SampleMean / type=kernel legendlabel="Mean"; density SampleMedian / type=kernel legendlabel="Median"; refline 0 / axis=x; run;
3. Analyze Approximate Sampling Distribution 3. Analyze Approximate Sampling Distribution. Examine sampling distribution of the variance and fit to chi-square distribution. /* scale the sample variances by (N-1)/sigma^2 */ data OutStatsNorm; set OutStatsNorm; ScaledVar = SampleVar * (&N-1)/1; run; /* Fit chi-square distribution to data */ proc univariate data=OutStatsNorm; label ScaledVar = "Variance of Normal Data (Scaled)"; histogram ScaledVar / gamma(alpha=15 sigma=2); ods select Histogram;
The effect of sample size
Generate samples %let reps = 1000; %let seed=54321; data SimUniSize; call streaminit(&seed); do obs = 10, 30, 50, 100; do rep = 1 to &rep; do i = 1 to obs; x = rand("Uniform"); output; end; run;
Compute mean for each sample proc means data=SimUniSize noprint; by obs rep; var x; output out=OutStats mean=SampleMean; run; proc print data=outstats(obs=10);run;
Summarize approx. sampling distribution of statistic proc means data=OutStats Mean Std; class obs; var SampleMean; run; proc means data=OutStats noprint; output out=out(where=(_TYPE_=1)) Mean=Mean Std=Std;
Use IML to create data to graph proc iml; use out;/*output dataset from proc means*/ read all var {N Mean Std};/*create vectors*/ close out;/close the dataset*/ NN = N; x = T( do(0.1, 0.9, 0.0025) ); create Convergence var {N x pdf};/*create an empty data set*/ do i = 1 to nrow(NN); N = j(nrow(x), 1, NN[i]); pdf = pdf("Normal", x, Mean[i], Std[i]); append;/*add this observation to data set*/ end; close Convergence;/*close the dataset*/ quit;
Graph Created Data ods graphics / ANTIALIASMAX=1300; proc sgplot data=Convergence; title "Sampling Distribution of Sample Mean"; label pdf = "Density" N = "Sample Size"; series x=x y=pdf / group=N; run;