Planning a Simulation Study

Planning a Simulation Study
Naomi Altman Department of Statistics

A small example Comparing 2 regression methods:
1) Take the range of X; divide in half, and use the line through in each half where ~ denotes median. 2) Use least squares regression.

A small example slope least squares my method

Where do we go from here?

Where do we go from here? What line should we use and does it matter?
Some lines used by the class: 2+5x 3+7x (HW 6) 3x 10x slope=rnorm(1000,10,3) What sample size should we use and does it matter? Some choices by the class: 25 (HW 6) 30 1000

Where do we go from here? What error SD should we use and does it matter? Some choices by the class: 2, 3 What x-values should be used: 1:25 rnorm(25,0,10) regenerated for each iteration

Working from What you know
Properties of LSE: If E(e)=0 then E(b1) = b1 If Var(e)=s2 then Var(b1) = s2 /Sxx If e~N(0, s2) then b1 has min MSE SSR=b12Sxx Signal to Noise Ratio ~ b12Sxx/s2 Properties of Slope of Medians: If XL and XU are the median X values, the values of the true line at those values is b0 + b1Xk, k=L,U. What is the median of the Y's? What is the variance of the median of the Y's? Could we show that this estimator is unbiased?

What is interesting Errors that are not normal: heavy tails skewed
Errors that are not independent Different sample sizes?

Picking Simulation Conditions
We also know that regression is defined as the mean of Y for each value of X. If Y does not have a mean (e.g. Cauchy) this might be interesting. We do not know anything about the median slope estimator. Statistical theory could be used to derive some properties. But there is one obvious property - since we split at the midpoint, it will not be resistant to leverage. X-distributions which are skewed or have a high leverage point might be interesting. It would be easy to show that the intercept does not matter.

What is interesting How do we compare: N(0,1) with t(1) no moments
t(2) mean 0, variance ∞ t(d) mean 0, variance 1/(d-2) What about Bin(n,p) mean=np, var=np(1-p)

t(2) mean 0, variance ∞ t(d) mean 0, variance 1/(d-2) What about Bin(n,p) mean=np, var=np(1-p) We probably want to separate issues regarding distribution from issues about variance - all distributions should be scaled to have the same variance. Errors need to have the same meaning - e.g. additive errors should have mean 0.

t(2) mean 0, variance ∞ t(d) mean 0, variance 1/(d-2) What about Bin(n,p) mean=np, var=np(1-p) If the distributions we are comparing do not have moments, we need to be careful about using the variance and MSE of the estimator. Also, it would be better to match center and "variability" based on quantiles which exist for all distributions.

What should vary? Typically in a regression study, we vary the errors.
But many regression methods that perform well for non-normal data are very sensitive to the choice of X.

Summary Some things to think about: What should the data "look like"?
You might use "realistic" data by doing a descriptive analysis of a real data set, and them simulating data to "look like" the real data. You might want to explore some property like skewness or long-tails.

Skewness, etc There are lots of skewed distributions - which one should you pick? does it matter?

Sample Size How big does "n" have to be to get "asymptotic" results?
How many "n" values do you need? How many samples do you need to simulate? How does computing time depend on n?

Computing Time How slow is "too slow"?
Can you improve your computer code? Should you switch to a faster computer system and if so, how much of your time will that take?

Computing Should you write your functions "from scratch" or use some else's code that may not be 100% what you want? If you write your own code, how can you be sure it is right?

What do you "save"? If you save everything, you may slow down the simulation too much. If you save only what you need now, you may need to rerun the simulation. e.g. Many people computed both the slope and intercept, even though we needed only the slope.

How do you vary conditions?
E.g. If you are looking at different variances, you could generate e.g. N(0,1) and N(0,3) and N(0,20) errors independently or (and usually better) generate N(0,1) and then multiply by 3 and by 20 to obtain the others. This reduces the variance in your comparisons.

Summarizing Results What plots and tables will you use to summarize the results? How will you assess if something "interesting" has happened? I generally try to plan these early in the study, so I do not have to redo the simulations.

Planning a Simulation Study

Similar presentations

Presentation on theme: "Planning a Simulation Study"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Planning a Simulation Study

Similar presentations

Presentation on theme: "Planning a Simulation Study"— Presentation transcript:

Similar presentations

About project

Feedback