Jack Davis Andrew Henrey FROM N00B TO PRO. PURPOSE Create a simulator from scratch that: Generates data from a variety of distributions Makes a response.

Slides:



Advertisements
Similar presentations
Bison Management Suppose you take over the management of a certain Bison population. The population dynamics are similar to those of the population we.
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.
Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Inferences for Regression.
Copyright © 2010 Pearson Education, Inc. Chapter 27 Inferences for Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Chapter 11- Confidence Intervals for Univariate Data Math 22 Introductory Statistics.
Objectives (BPS chapter 24)
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
July 1, 2008Lecture 17 - Regression Testing1 Testing Relationships between Variables Statistics Lecture 17.
Time Series Analysis Autocorrelation Naive & Simple Averaging
Multiple regression analysis
Simple Linear Regression
Discrete Event Simulation How to generate RV according to a specified distribution? geometric Poisson etc. Example of a DEVS: repair problem.
6. More on the For-Loop Using the Count Variable Developing For-Loop Solutions.
Point and Confidence Interval Estimation of a Population Proportion, p
Quantitative Methods – Week 6: Inductive Statistics I: Standard Errors and Confidence Intervals Roman Studer Nuffield College
8 Statistical Intervals for a Single Sample CHAPTER OUTLINE
Mixed models Various types of models and their relation
8-1 Introduction In the previous chapter we illustrated how a parameter can be estimated from sample data. However, it is important to understand how.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Fundamentals of Python: From First Programs Through Data Structures
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Chapter 13: Inference in Regression
Chapter 6 The Normal Probability Distribution
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Inferences for Regression
Sampling The sampling errors are: for sample mean
Hands-on Introduction to R. Outline R : A powerful Platform for Statistical Analysis Why bother learning R ? Data, data, data, I cannot make bricks without.
 2004 Prentice Hall, Inc. All rights reserved. 1 Chapter 11 - JavaScript: Arrays Outline 11.1 Introduction 11.2 Arrays 11.3 Declaring and Allocating Arrays.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 9 Section 1 – Slide 1 of 39 Chapter 9 Section 1 The Logic in Constructing Confidence Intervals.
REVIEW 2 Exam History of Computers 1. CPU stands for _______________________. a. Counter productive units b. Central processing unit c. Copper.
Lecture Set 5 Control Structures Part D - Repetition with Loops.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Introduction to Linear Regression
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Analisa Regresi Week 7 The Multiple Linear Regression Model
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
1 FUNCTIONS - I Chapter 5 Functions help us write more complex programs.
Computing for Research I Spring 2013
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
ANOVA, Regression and Multiple Regression March
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.
Searching CSE 103 Lecture 20 Wednesday, October 16, 2002 prepared by Doug Hogan.
Chapter 26 Inferences for Regression. An Example: Body Fat and Waist Size Our chapter example revolves around the relationship between % body fat and.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
CHAPTER 12 More About Regression
BINARY LOGISTIC REGRESSION
Linear Regression.
Let’s continue to do a Bayesian analysis
This Week Review of estimation and hypothesis testing
CHAPTER 12 More About Regression
QM222 Class 8 Section A1 Using categorical data in regression
MATLAB: Structures and File I/O
CHAPTER 29: Multiple Regression*
CHAPTER 22: Inference about a Population Proportion
When You See (This), You Think (That)
CHAPTER 12 More About Regression
CHAPTER 18: Inference about a Population Mean
CHAPTER 12 More About Regression
Presentation transcript:

Jack Davis Andrew Henrey FROM N00B TO PRO

PURPOSE Create a simulator from scratch that: Generates data from a variety of distributions Makes a response variable from a known function of the data (plus an error term) Constructs a linear model that estimates the coefficients of the function Repeats generation and modeling many times to compare the average estimates of the linear model to the known parameters. Package the whole thing nicely into a function that we can call in a single line in later work. If you’re experienced, the commands themselves may seem trivial

OUTLINE 1) Learning how to learn 2) Randomly Generating Data 3) Data Frames and Manipulation 4) Linear Models BREAK – Quality of presenter improves 5) Running loops 6) Function Definition 7) More advanced function topics 8) Using functions 9) A short simulation study

LEARNING HOW TO LEARN – JACK DAVIS Google CRAN Packages to get the package list From here you can get a description of every command in a package. ?? searches for commands related to ??plot will find commands related to plot ? calls up the help file for that command ?abline gives the help file for the abline() command.

LEARNING HOW TO LEARN – JACK DAVIS Exercises: Name one function in the darts game package. What is the of the author of the Texas Holdem simulation package? (Bonus) Tell the author about your day via ; s/he likes hearing from fans. Find a function to make a histogram Find some example code on the heatmap() command.

RANDOMLY GENERATING DATA – JACK DAVIS The r commands randomly generate data from a distribution rnorm( n, mean, sd) Generates from normal distribution (default N(0,1)) rexp( n, rate) rbinom( n, size, prob) rt( n, df) From Student’s T. (Mean is zero, so setting a mean is up to you) set.seed() Allows you to generate the same data every time, so you or others can verify work.

RANDOMLY GENERATING DATA – JACK DAVIS Set a random seed Generate a vector of 50 values from the Normal (mean=10,sd=4) distribution, name the vector x1. Do the same with Poisson ( lambda = 5), named x2, Exponential (rate = 1/7) named x3, Student’s t distribution (df =5), with a mean of 5, named x4, Normal (mean=0, sd=20), named err Make a new variable y, let it be x1 + 15x2 – 12x3 – 10x4 + err

DATA FRAMES – JACK DAVIS data.frame() makes a dataframe object of the vectors listed in the () The advantage of having a data frame is that it can be treated as a single object.. Data frames, models, and even matrix decompositions can be objects in R. You can call parts of objects by name using $ model$coef or model$coefficient will bring up the estimated coefficients If no such aspect exists, then you’ll get a null response. Example: Andrew$height

DATA FRAMES – JACK DAVIS Exercises: Make a data.frame() of x1,x2,x3,x4, and y Name it dat (if you’re stuck from the last part, run “Q3-dataframethis.txt” first) Use index indicators like dat [4,3], dat [2:7,3], dat [4,], and dat [4,-1] to get The 3 rd row, 5 th entry of dat The 2 nd – 7 th values of the 5 th column The entire 3 rd row The 3 rd row without the 1 st entry

LINEAR MODELS – JACK DAVIS The results of the lm() function are an object. Example: mod = lm(y ~ x1 + I(x2^2) + x1:x2, data=dat) Useful aspects mod$fitted mod$residuals Useful functions summary(mod) predict(mod, newdata)

LINEAR MODELS – JACK DAVIS Use the lm command to create a linear model of y as a function of x1,x2,x3, and x4 additively using dat data, name it mod. (No interactions or transformations) Get the summary of mod Display the estimated coefficients with no other values.

BREAK This slide unintentionally left 98% blank

OUTLINE 1) Learning how to learn 2) Randomly Generating Data 3) Data Frames and Manipulation 4) Linear Models BREAK – Quality of presenter improves deteriorates 5) Running loops 6) Function Definition 7) More advanced function topics 8) Using functions 9) A short simulation study

RUNNING YOU FOR A LOOP – ANDREW HENREY Similar to other programming languages, loops in R allow you to repeat the same block of code several times Unlike other programming languages, large loops in R are exceedingly slow Any loop of less than about 100,000 total iterations is not going to give you much trouble in terms of time

RUNNING YOU FOR A LOOP – ANDREW HENREY An R loop that executes a million commands takes about a second. Conditions vary wildly Generating 100,000 data sets of size 50,000 and looping through the dataset to calculate a mean for each one would take longer to run than Jack Davis heading up Burnaby Mountain (ouch) Sup d00dz late for tutorial  Jack

RUNNING YOU FOR A LOOP – ANDREW HENREY Loop syntax: for (i in 1:n) { #TellVicEverything }

RUNNING YOU FOR A LOOP – ANDREW HENREY No need to run from 1:K Can use an arbitrary vector instead Runs for length(vect) iterations Takes on the i th value of the vector each iteration e.g. V = c(1,5,3,-6) for (count in V) {print(count);} ## 1, 5, 3, -6

RUNNING YOU FOR A LOOP – ANDREW HENREY Exercises: A) Define a variable runs to be the number 10,000 B) Define a matrix() called mat with 5 columns and runs rows (10,000 rows) C) Put a for() loop around the code found in q5-loopthis.txt. Loop from 1:runs. Use index indicators like a[k,] to save the estimated coefficients of the model in a new row of mat. OR, if you think you are a total coding BOSS, then put the loop around your code in parts 2-4 that generates data and finds the linear model estimates of the betas.

FUNCTION DEFINITION – ANDREW HENREY Functions are a slightly abstract concept Mathematics: f(x) = x 2 +4x-16 Computing: mean(x) = sum(x)/length(x) – All 3 are functions! Functions map INPUTS to OUTPUT Possibly no inputs In one way or another, always some form of output Example functions: SORT, MEDIAN, OPTIM / NLM, LM/GLM

FUNCTION DEFINITION – ANDREW HENREY Function syntax Simple function: F = function() { return (5) } >> F() 5

FUNCTION DEFINITION – ANDREW HENREY Exercises: Make a function out of the code you wrote in part 5. The syntax should be similar to the previous slide. The function: Should be called simulate.lm Should include everything needed to generate the data several times, find a linear model, and extract the coefficients Does NOT take any inputs Should return the matrix of 10,000 runs of coefficients Use the function and save the results to a matrix called test If nothing is working (  ), you can use the example code in q6 – function this.txt

ADVANCED FUNCTIONS – ANDREW HENREY A more complicated example: MSE = function(X=c(0,3,11),Y) { return (mean((X-Y)^2)) } Observe that X has default values >>MSE(Y=c(4,5,6)) 15

ADVANCED FUNCTIONS – ANDREW HENREY If an input argument to a function has default values, you don’t have to specify them when calling the function If an input argument has no default values, running the function without specifying them gives you an error

ADVANCED FUNCTIONS – ANDREW HENREY Exercises: Modify simulate.lm() by adding input parameters. Include: nruns, the number of runs in the simulation, with no default seed, the random initial seed of the simulation, defaulting to 1337 verbose, a Boolean true/false to report progress, defaulting to FALSE (caps matter) Set runs to nruns at the beginning of your function Use set.seed(seed) in your function Add code that prints out how far along you are in the loop, but only when verbose is true Run this new function to overwrite the old one

USING FUNCTIONS – ANDREW HENREY Exercises: A) Run your simulate.lm() with runs. Store these results as a variable called betas B) Use hist() on the first column of betas to see the sampling distribution of the intercept C) Use summary(), mean(), and sd() on this column as well D) Use par(mfrow=c(2,2)) and then some hist() commands to display histograms of the other four sampling distributions in a 2x2 grid E) Compare your results to the known values (The means of the sampling distributions to the true values, and the standard deviations to the estimated values in Q4)

SIMULATION STUDY – ANDREW HENREY Idea: You have a binomial experiment with 9 successes and 3 failures. You would like to construct a 95% CI for the true proportion of successes You DON’T know whether the normal approximation is appropriate How can we find out whether or not it’s OK?

SIMULATION STUDY – ANDREW HENREY Overall procedure: Construct a LOT of samples from a population with 12 trials and p=0.75 For each sample, calculate the 95% CI using the normal approximation For each sample, see whether the CI overlaps with 0.75 Count the number of samples for which the CI overlaps with 0.75 The proportion of the samples that have a CI that overlaps is called the “true coverage probability” If the true coverage probability is close to 95%, the normal approximation to the sampling distribution of p is a good one.

SIMULATION STUDY – ANDREW HENREY Steps: Generate 100,000 samples of binomial(12,0.75) data using rbinom() For each sample, calculate the usual estimate of p For each sample, calculate SE = sqrt(p*(1-p)/12) For each sample, calculate the lower and upper bounds of the 95% CI Find out how many intervals actually contain 0.75 Optional: Look at hist(x) to gain intuition of why the normal approximation isn’t perfect

THE END Leave plz tyvm