Correlation coefficients and simple linear regression Chapter 7.1 ~ 7.6
1. Correlation coefficients Pearson Spearman 2. Simple linear regression 3. Generation of correlated random numbers Visualization Contents
Example
1.1. Pearson correlation coefficient Pearson's correlation coefficient (r) between two variables is defined as the covariance of the two variables X i and Y i divided by the product of their standard deviations. i-th value in Xi-th value in Ymean of Xmean of Y
my.cor <- function(x,y){ mx <- mean(x) my <- mean(y) Sxx <- sum((x-mx)^2) Syy <- sum((y-my)^2) Sxy <- sum((x-mx)*(y-my)) r <- Sxy/sqrt(Sxx)/sqrt(Syy) return(r) } #cor(x,y) 1.2. Pearson correlation coefficient Example: x <- c(26,25,23,27,28,25,22,26,25,23) y <- c(54,62,51,58,63,65,59,63,65,60)
The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables. For a sample of size n, the n raw scores X i, Y i are converted to ranks x i, y i, and ρ is computed from these: Spearman correlation coefficient
my.cor2 <- function(x,y,method=c("pearson","spearman")){ method <- match.arg(method) if (method == "spearman") { x <- rank(x) y <- rank(y) } mx <- mean(x) my <- mean(y) Sxx <- sum((x-mx)^2) Syy <- sum((y-my)^2) Sxy <- sum((x-mx)*(y-my)) r <- Sxy/sqrt(Sxx)/sqrt(Syy) return(r) } 1.4. Spearman correlation coefficient
2.1. Simple linear regression Simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible.
2.2. Simple linear regression Suppose there are n data points {y i, x i }, where i = 1, 2, …, n. The goal is to find the equation of the straight line that minimizes the sum of squared residuals of the linear regression model. In other words, numbers a and b solve the following minimization problem:
2.3. Simple linear regression Partial derivative of provides equations to get a and b that minimize the sum of squared residuals.
2.4. Simple linear regression my.reg <- function(x,y){ sx2 <- sum(x^2) sx <- sum(x) sxy <- sum(x*y) sy <- sum(y) A <- matrix(c(sx2,sx,sx,length(x)),2,2) B <- matrix(c(sxy,sy)) v <- solve(A,B) return(v) } #lm(y~x)
3.1. Generation of correlated random numbers Generating two sequences of random numbers with a given correlation is done in two simple steps: 1. Generate two sequences of uncorrelated normal distributed random numbers 2. Define a new sequence This new sequence Z will have a correlation of with the sequence X.
3.2. Generation of correlated random numbers r2norm <- function(n=100, rho=0.5) { x <- rnorm(n) y <- rho*x + sqrt(1-rho^2)*rnorm(n) return(data.frame(x=x,y=y)) }
3.3. Scatter plot and histgrams scatterplot2<-function (x, y,...) { def.par<-par(no.readonly = TRUE) n<-length(x) xhist<-hist(x, sqrt(n), plot = FALSE) yhist<-hist(y, sqrt(n), plot = FALSE) top<-max(c(xhist$counts, yhist$counts)) xrange<-c(min(x), max(x)) yrange<-c(min(y), max(y)) nf<-layout(matrix(c(2, 0, 1, 3), 2, 2, TRUE), c(3, 1), c(1, 3), TRUE) par(mar = c(3, 3, 1, 1)) plot(x, y, xlim = xrange, ylim = yrange, xlab = "x", ylab = "y",...) #abline(lm(y ~ x)) par(mar = c(0, 3, 1, 1)) barplot(xhist$counts, axes = FALSE, ylim = c(0, top), space = 0, col = gray(0.95)) par(mar = c(3, 0, 1, 1)) barplot(yhist$counts, axes = FALSE, xlim = c(0, top), space = 0, col = gray(0.95), horiz = TRUE) par(def.par) }
r = Scatter plot and histgrams
3.5. 3D histgram res<-r2norm(100000) library(gregmisc) h2d <- hist2d(res$x, res$y, show=FALSE, same.scale=TRUE, nbins = c(50,50)) Z <- h2d$counts/max(h2d$counts) X <- h2d$x Y <- h2d$y library(rgl) zlim <-c(0,max(h2d$counts)) zlen <- zlim[2] - zlim[1] + 1 colorlut <- cm.colors(zlen) # height color lookup table col <- colorlut[(Z-zlim[1]+1)] # assign colors to heights for each point open3d() surface3d(X, Y, 5*Z, color=col,alpha=0.65, back="lines") axes3d(c('x','y','z-')) grid3d(c("x", "y+", "z"),at = NULL, col=c("gray","gray","gray","gray","gray","gray","gray","gray","gray", "gray","gray","gray","gray","gray"), n =8,lwd = 1, lty = "solid") title3d(main = “Correlation”, sub = "", xlab = "X", ylab = "Y", zlab = "") rgl.bringtotop()