R: A Statistics Program For Teaching & Research Josué Guzmán 11 Nov
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.2 Some Useful R Links R Home Page CRAN Precompiled Binary Distributions Windows (95 and later) R Manuals
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.3 R Installation R: Statistical Analysis & Graphics Freely Available Under GPL Binary Distributions Installation – Standard Steps
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.4
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.5
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.6 Running R
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.7 Statistical Programming with R Learn Language Basics Learn Documentation / Help System Learn Data Manipulation & Graphics Perform Basic Statistical Analysis
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.8 First Steps: Interacting with R Type a Command & Press Enter R Executes (printing the result if relevant) R waits for more input
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.9 Some Examples 2 * 2 [1] 4 exp(-2) [1] rdmnorm =rnormal(1000)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.10 R Functions exp, log and rnorm are functions Function calls are indicated by the presence of parentheses Example: hist(rdmnorm, col = "magenta")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.11 Variables and Assignments The = operator; the <- operator also works x = 2.2 y = x sqrt(x) y x ^ y
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.12 Variables and Assignments Variable names cannot start with a digit Names are Case-Sensitive Some common names are already used by R Examples: c, q, t, C, D, F, I, T Should be avoided
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.13 Vectorized Arithmetic Elementary data types in R are all vectors The c(...) construct used to create vectors: Bolstad, 2004, exercise 13.2, page 253 fertilizer = c(1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5) fertilizer
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.14 Vectorized Arithmetic [cont.] Arithmetic operations (+, -, *, /, ^) and mathematical functions (sin, cos, log, …) work element-wise on vectors yield = c(25, 31, 27, 28, 36, 35, 32, 34) log(yield)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.15 Vectorized Arithmetic [cont.] sum.yield = sum(yield) sum.yield n = length(yield) n avg.yield = sum.yield/n avg.yield
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.16 Graphics plot(x, y) function – simple way to produce R graphics: plot(fertilizer, log(yield), main = "Fertilizer vs. Yield")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.17 Getting Help help.start( ) Starts a browser window with an HTML help interface. Links to manual An Introduction to R, as well as topic-wise listings. help(topic) Help page for a particular topic or function. Every R function has a help page. help.search("search string") Subject/keyword search
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.18 Getting Help [cont.] Short-cut: question mark (?) help(plot) ? plot To know about a specific subject, use help.search function. Example: help.search("logarithm")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.19 apropos( ) apropos function - list of topics that partially match its argument: apropos("plot")[1:10] [1] ".__C__recordedplot" "biplot" [3] "interaction.plot" "lag.plot" [5] "monthplot" "plot.TukeyHSD" [7] "plot.density" "plot.ecdf" [9] "plot.lm" "plot.mlm"
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.20 R Packages R makes use of a system of packages Each package is a collection of routines with a common theme The core of R itself is a package called base A collection of packages is called a library Some packages are already loaded when R starts up Other packages need be loaded using the library function
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.21 R Packages [cont.] Several packages come pre-installed with R: installed.packages( )[, 1] [1] "ISwR" "KernSmooth" "MASS" "base" [5] "boot" "class" "cluster" "foreign" [9] "graphics" "grid" "lattice" "methods" [13] "mgcv" "nlme" "nnet" "rpart" [17] "spatial" "splines" "stats" "stats4" [21] "survival" "tcltk" "tools" "utils"
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.22 Contributed Packages Many packages are available from CRAN Some packages are already loaded when R starts up. List of currently loaded packages - use search: search( ) [1] ".GlobalEnv" "package:tools" "package:methods" [4] "package:stats" "package:graphics" "package:utils" [7] "Autoloads" "package:base"
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.23 R Packages Can be loaded by the user. Example: UsingR package library(UsingR) New packages downloaded using the install.packages function: install.packages("UsingR") library(help = UsingR)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.24 Data Types vector – Set of elements in a specified order matrix – Two-dimensional array of elements of the same mode factor – Vector of categorical data data frame – Two-dimensional array whose columns may represent data of different modes list – Set of components that can be any other object type
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.25 Editing Data Sets Can create and modify data sets on the command line xx = seq(from = 1, to = 5) xx x2 = 1 : 5 x2 yy = scan( ) yy Can edit a data set once it is created edit(mydata) data.entry(mydata)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.26 Built-in Data Data from a library: library(UsingR) attach(cfb)#Consumer-Finances Survey cfb$INCOME cfb$EDUC educ.fac = factor(EDUC) plot(INCOME ~ educ.fac, xlab = "EDUCATION", ylab = "INCOME") detach(cfb)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.27 Data Modes logical – Binary mode, values represented as TRUE or FALSE numeric – Numeric mode [integer, single, & double precision] complex – Complex numeric values character – Character values represented as strings
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.28 Data Frames read.table( ) – Reads in data from an external file read.table("data.txt", header = T) read.table(file = file.choose( ), header = T) data.frame – Binds R objects of various kinds
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.29 read.table Function Reads ASCII file, creates a data frame Data in tables of rows and columns If first line contains column labels: Use argument header = T Field separator is white space Also read.csv and read.csv2 –Assume, and ; separations, respectively Treats characters as factors
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.30 save( ) and load( ) Used for R Functions and Objects Understandable to load only x = 23 y = 44 save(x, y, file = "xy.Rdata") load("xy.Rdata")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.31 Comparison Operators != Not Equal To < Less Than <= Less Than or Equal To == Exactly Equal To > Greater Than >= Greater Than or Equal To
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.32 Some Logical Operators ! Not | Or (For Calculating Vectors and Arrays of Logicals) & And (For Calculating Vectors and Arrays of Logicals)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.33 Some Mathematical Functions abs Absolute Value ceiling Next Larger Integer floor Next Smallest Integer cos, sin, tan Trigonometric Functions exp(x) e^x [e = …] log Natural Logarithm log10 Logarithm Base 10 sqrt Square Root
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.34 Statistical Summary Functions length Length of Object max Maximum Value mean Arithmetic Mean median Median min Minimum Value prod Product of Values quantile Empirical Quantiles sum Sum var Variance - Covariance sd Standard Deviation cor Correlation Between Vectors or Matrices
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.35 Sorting and Other Functions rev Put Values of Vectors in Reverse Order sort Sort Values of Vector order Permutation of Elements to Produce Sorted Order rank Ranks of Values in Vector match Detect Occurrences in a Vector cumsum Cumulative Sums of Values in Vector cumprod Cumulative Products
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.36 Plotting Functions Useful for One-Dimensional Data barplotBar plot boxplotBox & Whisker plot histHistogram dotchartDot plot piePie chart
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.37 Plotting Functions Useful for Two-Dimensional Data plot Creates a scatter plot: plot(x, y) qqnorm Quantile-quantile plot sample vs. N(0, 1): qqnorm(x) qqplot Plot quantile-quantile plot for two samples: qqplot(x, y) pairsCreates a pairs or scatter plot matrix: attach(babies) pairs(babies[, c("gestation", "wt", "age", "inc" ) ] )
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.38 Three-Dimensional Plotting Functions contourContour plot perspPerspective plot imageImage plot
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.39 Probability Distributions Using R Pseudo-random sampling sample(0:20, 5) # select 5 WOR sample(0:20, 5, replace = T) # select WR Coin toss simulation [0 = tail; 1 = head] 20 tosses: sample(c(0, 1), 20, replace=T)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.40 For Any Probability Distribution ddist density or probability pdist cumulative probability qdist quantiles [percentiles] rdist pseudo-random selection
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.41 Binomial Distribution X ~ Binomial(n, p) ; x = 0, 1, …, n dbinom(x, n, p ) Density or point probability pbinom(x, n, p ) Cumulative distribution qbinom(q, n, p ) Quantiles [ 0 < q < 1 ] rbinom(m, n, p )Pseudo-random numbers
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.42 Binomial Distribution Coin toss simulation: x = 0:20 # num. of heads in 20 tosses px = dbinom(x, size = 20, prob = 0.5) plot(x, px, type = "h") # graph display curve(dnorm(x, 10, sqrt(20*.5*.5)), col=2, add=T)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.43
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.44 Normal Distribution X ~ Normal(µ, ) dnorm(x, µ, ) Density pnorm(x, µ, ) Cumulative probability qnorm(q, µ, ) Quantiles rnorm(m, µ, ) Random numbers
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.45 Standard Normal x = seq(-3.5,3.5,0.1) # x ~ N(0,1) prx = dnorm(x) # M = 0, SD = 1 plot(x, prx, type = "l" ) Or using: curve(dnorm(x), from = -3.5, to = 3.5)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.46 Cumulative Normal & Quantiles curve(pnorm(x), from=-3.5,to=3.5) qnorm(.25) #Percentile 25, x~N(0,1) qnorm(.75, m=50, sd=2) # M=50,SD=2 qnorm(c(.1,.3,.7,.9), m=65, sd=3)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.47 Poisson Distribution X ~ Poisson( λ ) ; X = 0, 1, 2, 3, … x = 0:20 # Suppose λ = 3.5 prx = dpois(x, lambda = 3.5) plot(x, prx, type = "h", main = "Poisson Distribution") text(10,.10, "Lambda = 3.5")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.48
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.49 Sampling Distributions n = 25; curve(dnorm(x, 0, 1/sqrt(n)), -3, 3, xlab = "Mean", ylab = "Densities of Sample Mean", bty = "l" ) n=5 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T) n=1 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.50
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.51 t – Distribution as df Increase curve(dnorm(x), -4, 4, main="Normal & t Distributions", ylab="Densities" ) k=3; curve(dt(x, df = k ), lty = k, add = T) k=5; curve(dt(x, df = k ), lty = k, add = T) k=15; curve(dt(x, df = k ), lty = k, add = T) k=100; curve(dt(x, df = k ), lty = k, add = T)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.52
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.53 Binomial-Normal Approximation Coin toss example: n = 100, p =.5 P(X ≤ 40)? Using Larget’s prob.R file: source(file.choose( ) ) gbinom(100,.5, b = 40 ) Normal approximation: µ = 50, = 5 gnorm(50, 5, b = 40.5)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.54
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.55
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.56 One-Sample t-test Ho: µ = µ 0 Null Hypothesis Ha: µ µ 0 Two-sided Ha: µ > µ 0 One-sided Ha: µ < µ 0 One-sided
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.57 R One-Sample t.test x = c(x1, x2, …, xn)# data set t.test(x, mu = Mo) # two-sided t.test(x, mu = Mo, alt = "g") # one-sided t.test(x, mu = Mo, alt = "l") # one-sided
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.58 R One-Sample t.test [cont.] Example: Text, Problem 8.11, page 226 library(UsingR) attach(stud.recs) x = sat.m # Math SAT Scores hist(x) # Visual display qqnorm(x) # Normal quantile plot qqline(x, col=2)# Add equality line t.test(x, mu = 500) detach(stud.recs)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.59 Normality Test Shapiro-Wilk test: Ho: X ~ Normal Ha: X !~ Normal Command: shapiro.test(x) # Examine p-value
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.60 Normality Test [cont.] Example: On Base % data(OBP) summary(OBP) boxplot(OBP)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.61 Normality Test [cont.] qqnorm(OBP) qqline(OBP, col=2) shapiro.test(OBP) wilcox.test(OBP, mu=.330)
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.62 One-Sample Proportion Test x total successes; n sample size prop.test(x, n, p = Po) # two-sided prop.test(x, n, p = Po, alt= "g") prop.test(x, n, p = Po, alt= "l")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.63 Or Using Binomial “Exact” Test binom.test(x, n, p = Po) binom.test(x, n, p = Po, alt = "g") binom.test(x, n, p = Po, alt = "l")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.64 Proportion Test Text, Example 8.3: Survey US Poverty Rate Ho: P = # Year 2000 Rate Ha: P > # Year 2001 Rate Increased x = 5850 # Sample people UPL n = # Sample size prop.test(x, n, p = 0.113, alt = "g") binom.test(x, n, p = 0.113, alt = "g")
© J. Guzmán, 2007R: Stat. Prog. for Teach. & Res.65 Some Modeling Functions/Packages Linear Models:anova, car, lm, glm Graphics:graphics, grid, lattice Multivariate:mva, cluster Survey:survey SQC:qcc Time Series:tseries Bayesian:BRugs, MCMCpack, … Simulation:boot, bootstrap, Zelig
You Perform An Experiment In Order To Learn, Not To Prove. W Edwards Deming