1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction, Factor Analysis
Advertisements Mike Schroepfer, Facebook CTO, will be lecturing and doing Q&A for Data and Society (CSCI 4967/6963) on Friday, April 24 in Walker 5113 from 9 to 11. Who would attend (must confirm)? 2
If you did not complete svm Lab9b_svm(1,11}_2015.R 3
Cross-validation - coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y
Lab11_11_2014.R > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations
Lab11b_12_2014.R > cvFits 5-fold CV results: Fit CV 1 LS MM LTS Best model: CV "MM" 6
50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 7
cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw Best model: reweighted raw "0.75" "0.75" 8
Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 9
cvFitsLmrob 5-fold CV results: tuning.psi CV Optimal tuning parameter: tuning.psi CV
Lab11b_18 mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1]
Lab11b_18 # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1]
randomForest > library(e1071) > library(rpart) > library(mlbench) # etc. > data(kyphosis) > require(randomForest) # or library(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results > importance(fitKF) # importance of each predictor # what else can you do? data(swiss) # fertility? Lab10b_rf3_2015.R data(Glass,package=“mlbench”) # Type ~ ? data(Titanic) # Survived ~. Find - Mileage~Price + Country + Reliability + Type 13
MDS Lab8b_mds1_2015.R Lab8b_mds2_2015.R Lab8b_mds3_2015.R lhttp:// l to/2013/01/23/MDS-in-R.htmlhttp://gastonsanchez.com/blog/how- to/2013/01/23/MDS-in-R.html 14
R – many ways (of course) library(igraph) g <- graph.full(nrow(dist.au)) V(g)$label <- city.names layout <- layout.mds(g, dist = as.matrix(dist.au)) plot(g, layout = layout, vertex.size = 3) 15
Distances between Australian cities # dist.au <- read.csv(" ta/dist-Aus.csv") # Lab8b_mds1_2015.R row.names(dist.au) <- dist.au[, 1] dist.au <- dist.au[, -1] dist.au ## A AS B D H M P S ## A ## AS ## B ## D ## H ## M ## P ## S
Distances between Australian cities fit <- cmdscale(dist.au, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x, y, pch = 19, xlim = range(x) + c(0, 600)) city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", "Melbourne", "Perth", "Sydney") text(x, y, pos = 4, labels = city.names) Try the other MDS functions... 17
In R function (library) cmdscale() (stats) smacofSym() (smacof) wcmdscale() (vegan) pco() (ecodist) pco() (labdsv) pcoa() (ape) Only stats is loaded by default, and the rest are not installed by default 18
Do these dimension reductions Lab8b_dr1_2015.R Lab8b_dr2_2015.R Lab8b_dr3_2015.R Lab8b_dr4_2015.R 19
Factor Analysis data(iqitems) # data(ability) ability.irt <- irt.fa(ability) ability.scores <- score.irt(ability.irt,ability) data(attitude) cor(attitude) # Compute eigenvalues and eigenvectors of the correlation matrix. pfa.eigen<-eigen(cor(attitude)) pfa.eigen$values # set a value for the number of factors (for clarity) factors<-2 # Extract and transform two components. pfa.eigen$vectors [, 1:factors ] %*% + diag ( sqrt (pfa.eigen$values [ 1:factors ] ),factors,factors ) 20
Glass index <- 1:nrow(Glass) testindex <- sample(index, trunc(length(index)/3)) testset <- Glass[testindex,] trainset <- Glass[-testindex,] Cor(testset) Factor Analysis? 21
Try these example_exploratoryFactorAnalysis.R on dataset_exploratoryFactorAnalysis.csv – tutorial-series-exploratory-factor.htmlhttp://rtutorialseries.blogspot.com/2011/10/r- tutorial-series-exploratory-factor.html mlhttp:// ml 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysihttp://stats.stackexchange.com/questions/157 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysi Lab10b_fa{1,2,4,5}_2015.R 22