Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,

Similar presentations


Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,"— Presentation transcript:

1 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction, Factor Analysis

2 Advertisements Mike Schroepfer, Facebook CTO, will be lecturing and doing Q&A for Data and Society (CSCI 4967/6963) on Friday, April 24 in Walker 5113 from 9 to 11. Who would attend (must confirm)? 2

3 If you did not complete svm Lab9b_svm(1,11}_2015.R 3

4 Cross-validation - coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y 1 3.83 28.87 7.20 26.6 6.19 37.01 2 2.89 20.10 -11.71 24.4 5.17 26.51 3 2.86 69.05 12.32 25.7 7.04 36.51 4 2.92 65.40 14.28 25.7 7.10 40.70 5 3.06 29.59 6.31 25.4 6.15 37.10 6 2.07 44.82 6.16 21.6 6.41 33.90 4

5 Lab11_11_2014.R > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] 0.9880672 [2,] 0.9525881 [3,] 0.8989264 [4,] 1.0177694 [5,] 0.9860661 [6,] 1.8369717 [7,] 0.9550428 [8,] 1.0698466 [9,] 1.3568537 [10,] 0.8313474 5 Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations

6 Lab11b_12_2014.R > cvFits 5-fold CV results: Fit CV 1 LS 1.674485 2 MM 1.147130 3 LTS 1.291797 Best model: CV "MM" 6

7 50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 7

8 cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw 1 0.5 1.291797 1.640922 2 0.75 1.065495 1.232691 Best model: reweighted raw "0.75" "0.75" 8

9 Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 9

10 cvFitsLmrob 5-fold CV results: tuning.psi CV 1 3.14 1.179620 2 3.44 1.156674 3 3.88 1.169436 4 4.68 1.133975 Optimal tuning parameter: tuning.psi CV 4.68 10

11 Lab11b_18 mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] 0.4918650 0.4916571 > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] 0.4967271 0.4938003 # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1] 0.491865 11

12 Lab11b_18 # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] 0.1886792 0.1886792 > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1] 0.2264151 0.2228551 12

13 randomForest > library(e1071) > library(rpart) > library(mlbench) # etc. > data(kyphosis) > require(randomForest) # or library(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results > importance(fitKF) # importance of each predictor # what else can you do? data(swiss) # fertility? Lab10b_rf3_2015.R data(Glass,package=“mlbench”) # Type ~ ? data(Titanic) # Survived ~. Find - Mileage~Price + Country + Reliability + Type 13

14 MDS Lab8b_mds1_2015.R Lab8b_mds2_2015.R Lab8b_mds3_2015.R http://www.statmethods.net/advstats/mds.htm lhttp://www.statmethods.net/advstats/mds.htm l http://gastonsanchez.com/blog/how- to/2013/01/23/MDS-in-R.htmlhttp://gastonsanchez.com/blog/how- to/2013/01/23/MDS-in-R.html 14

15 R – many ways (of course) library(igraph) g <- graph.full(nrow(dist.au)) V(g)$label <- city.names layout <- layout.mds(g, dist = as.matrix(dist.au)) plot(g, layout = layout, vertex.size = 3) 15

16 Distances between Australian cities # dist.au <- read.csv("http://rosetta.reltech.org/TC/v15/Mapping/da ta/dist-Aus.csv") # Lab8b_mds1_2015.R row.names(dist.au) <- dist.au[, 1] dist.au <- dist.au[, -1] dist.au ## A AS B D H M P S ## A 0 1328 1600 2616 1161 653 2130 1161 ## AS 1328 0 1962 1289 2463 1889 1991 2026 ## B 1600 1962 0 2846 1788 1374 3604 732 ## D 2616 1289 2846 0 3734 3146 2652 3146 ## H 1161 2463 1788 3734 0 598 3008 1057 ## M 653 1889 1374 3146 598 0 2720 713 ## P 2130 1991 3604 2652 3008 2720 0 3288 ## S 1161 2026 732 3146 1057 713 3288 0 16

17 Distances between Australian cities fit <- cmdscale(dist.au, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x, y, pch = 19, xlim = range(x) + c(0, 600)) city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", "Melbourne", "Perth", "Sydney") text(x, y, pos = 4, labels = city.names) Try the other MDS functions... 17

18 In R function (library) cmdscale() (stats) smacofSym() (smacof) wcmdscale() (vegan) pco() (ecodist) pco() (labdsv) pcoa() (ape) Only stats is loaded by default, and the rest are not installed by default 18

19 Do these dimension reductions Lab8b_dr1_2015.R Lab8b_dr2_2015.R Lab8b_dr3_2015.R Lab8b_dr4_2015.R 19

20 Factor Analysis data(iqitems) # data(ability) ability.irt <- irt.fa(ability) ability.scores <- score.irt(ability.irt,ability) data(attitude) cor(attitude) # Compute eigenvalues and eigenvectors of the correlation matrix. pfa.eigen<-eigen(cor(attitude)) pfa.eigen$values # set a value for the number of factors (for clarity) factors<-2 # Extract and transform two components. pfa.eigen$vectors [, 1:factors ] %*% + diag ( sqrt (pfa.eigen$values [ 1:factors ] ),factors,factors ) 20

21 Glass index <- 1:nrow(Glass) testindex <- sample(index, trunc(length(index)/3)) testset <- Glass[testindex,] trainset <- Glass[-testindex,] Cor(testset) Factor Analysis? 21

22 Try these example_exploratoryFactorAnalysis.R on dataset_exploratoryFactorAnalysis.csv –http://rtutorialseries.blogspot.com/2011/10/r- tutorial-series-exploratory-factor.htmlhttp://rtutorialseries.blogspot.com/2011/10/r- tutorial-series-exploratory-factor.html http://www.statmethods.net/advstats/factor.ht mlhttp://www.statmethods.net/advstats/factor.ht ml http://stats.stackexchange.com/questions/157 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysihttp://stats.stackexchange.com/questions/157 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysi Lab10b_fa{1,2,4,5}_2015.R 22


Download ppt "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,"

Similar presentations


Ads by Google