Download presentation
Presentation is loading. Please wait.
1
Introduction to R and RStudio
Jeff Witmer 9 March 2016
2
A software package for statistical computing and graphics
R is A software package for statistical computing and graphics A collection of 6,700 packages (as of June 2015, so more now) A (not ideal) programming language A work environment Widely used Powerful Free R is an interpreted language, but with much of it compiled in C.
4
Some history S was developed at Bell Labs, starting in the 1970s R was created in the 1990s by Ross Ihaka and Robert Gentleman R was based on S, with code written in C S largely was used to make good graphs – not an easy thing in R, like S, is quite good for graphing. For lots of examples, see or See ggplot2-cheatsheet-2.0.pdf (Or for more detail, see
5
A few simple graphs using the ggplot2 package
6
An example of graphing using the GGally package in R
7
Who uses R?
9
RStudio is An Integrated Development Environment (IDE) for R A gift, from J.J. Allaire (Macalester College, ‘91) to the world An easy (easier) way to use R Available as a desktop product or, as used at OC, run off of a file server. Free – unless you want the newest version, with more bells and whistles, and you are not eligible for the educational discount (= free) R supports rpubs – see
10
RStudio screen shot
11
R is object-oriented e.g., MyModel <- lm(wt ~ ht, data = mydata) then hist(MyModel$residuals) Note: lm(wt ~ ht*age + log(bp), data = mydata) regresses wt on ht, age, the ht-by-age interaction, and log(bp). There is no need to create the interaction or the lob(bp) variable outside of the lm() command. Comparing nested models: mod1 <- lm(wt ~ ht*age + log(bp), data = mydata) mod2 <- lm(wt ~ ht + log(bp), data = mydata) anova(mod2, mod1) gives a nested F-test
12
R as a programming language
If you want R to be (relatively) fast, take advantage of vector operations; e.g., use the replicate command (rather than a loop) or the tapply function. E.g., replicate(k=25,addingLines(n=10)) calls the addingLines function (something I wrote) 25 times. > with(Dabbs, tapply(testosterone, occupation, mean)) Actor MD Minister Prof
13
If you want to know how to do something in R
See the “Minimal R.pdf” handout Go to the Quick-R.com page ( Google “How do I do xxx in R?” A standing joke among R users is that the answer is always “There are many ways to do that in R.” See See
14
Speaking of many ways to do something in R…
(1) mean(mydata$ht) (2) with(mydata, mean(ht)) (3) mean(ht, data=mydata) However (1) plot(mydata$ht,mydata$wt) works plot(wt~ht, data=mydata) feeds the plot command a function, whereas plot(ht, wt, data=mydata) doesn’t (2) with(mydata, plot(ht,wt)) works (3) plot(ht, wt, data=mydata) does not work (3a) plot(wt~ht, data=mydata) works
15
The mosaic package (Kaplan, Pruim, Horton) was created to make R easy to use for intro stats.
mosaic package syntax: goal(y ~ x|z, data=mydata) E.g.: tally(~sex, data=HELPrct) E.g.: test(age ~ sex, data=HELPrct) E.g.: t.test(age ~ sex, data=HELPrct)$p.value E.g.: favstats(age ~ substance|sex, data=HELPrct) See MinimalR-2pages.pdf
16
The mosaic package mPlot() command makes graphing easy.
mPlot(SaratogaHouses)
17
The openintro package edaPlot() command makes exploring data graphically easy to do.
edaPlot(SaratogaHouses)
18
The mosaic tidyr and dplyr packages handle SQL-ytpe work: merging files, extracting subsets, etc.
data(NCHS) #loads in the NCHS data frame newNCHS <- NCHS %>% sample_n(size=5000) %>% filter(age > 18) #takes a sample of size 5000, extracts only the rows for which age > 18, and saves the result in newNCHS See data-wrangling-cheatsheet.pdf
19
I use R, and the do() command in the mosaic package, for simulations.
data(FirstYearGPA) #loads in the data frame FY <- FirstYearGPA) #rename the data frame lm(GPA ~ SATM, data=FY) #gives as slope lm(GPA ~ SATM, data=FY)$coeff[2] #just look at the slope do(3)*lm(GPA ~ shuffle(SATM), data=FY)$coeff[2] #break link b/w GPA and SATM null.dist <- do(1000)*lm(GPA ~ shuffle(SATM), data=FY)$coeff[2] #1000 random slopes histogram(null.dist$SATM, v=0.0012) #look at the 1000 slopes with(null.dist, tally(abs(SATM.)>=0.0012)) #How many are far from zero? with(null.dist, tally(abs(SATM.)>=0.0012, format='prop')) #What proportion are far from zero?
20
Using Predict.Plot to show Pr(win) as SaveDiff varies, for a fixed set of values for sixother predictors. plot(jitter(Win,amount=.05)~SaveDiff,data=LaXdata) Predict.Plot(modelDiff,pred.var="SaveDiff",DrawDiff=-11, ShotDiff=6, TODiff=-3, ClearPctDiff=0.0952, ShotGoalDiff=1, GroundDiff=5, add=TRUE,plot.args=list(col='blue')) #OCWLaX game vs BW myx=data.frame(DrawDiff=-11, ShotDiff=6, TODiff=-3, SaveDiff = 0, ClearPctDiff=0.0952, ShotGoalDiff=1, GroundDiff=5) predict.glm(modelDiff,myx,type="response") #gives 0.896
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.