1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction,

Slides:



Advertisements
Similar presentations
1 Regression as Moment Structure. 2 Regression Equation Y =  X + v Observable Variables Y z = X Moment matrix  YY  YX  =  YX  XX Moment structure.
Advertisements

Component Analysis (Review)
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Model generalization Test error Bias, variance and complexity
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Data mining and statistical learning - lecture 6
Lecture 7: Principal component analysis (PCA)
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Lecture 3 HSPM J716. New spreadsheet layout Coefficient Standard error T-statistic – Coefficient ÷ its Standard error.
Bayesian belief networks 2. PCA and ICA
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 14, 2015 Interpreting cross-validation, bootstrapping, bagging, boosting, etc.
Efficient Model Selection for Support Vector Machines
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7b, March 13, 2015 Interpreting weighted kNN, decision trees, cross-validation, dimension reduction.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.
Regression. Population Covariance and Correlation.
Multidimensional scaling MDS  G. Quinn, M. Burgman & J. Carey 2003.
Chapter 7 Relationships Among Variables What Correlational Research Investigates Understanding the Nature of Correlation Positive Correlation Negative.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.
G Lecture 81 Comparing Measurement Models across Groups Reducing Bias with Hybrid Models Setting the Scale of Latent Variables Thinking about Hybrid.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
SVM Lab material borrowed from tutorial by David Meyer FH Technikum Wien, Austria see:
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Validation methods.
 Insight – extracting conceptually appealing information from data  Exposition – displaying the decision tree results in a form to communicate insight.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
1 Peter Fox Data Analytics – 4600/6600 Week 9a, March 29, 2016 Dimension reduction and MD scaling, Support Vector Machines.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 7a, March 8, 2016 Decision trees, cross-validation.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 11a, April 12, 2016 Interpreting: MDS, DR, SVM Factor Analysis; and Boosting.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Support Vector Machines
Peter Fox and Greg Hughes
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Gene Set Enrichment Analysis
Interpreting: MDS, DR, SVM Factor Analysis
Labs: Dimension Reduction, Factor Analysis
Labs: Dimension Reduction, Factor Analysis
Labs: Dimension Reduction, Multi-dimensional Scaling, SVM
Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 9b, April 1, 2016
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
CMPT 733, SPRING 2016 Jiannan Wang
Labs: Dimension Reduction, Multi-dimensional Scaling, SVM
Interpreting: MDS, DR, SVM Factor Analysis
Bayesian belief networks 2. PCA and ICA
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Interpreting: MDS, DR, SVM Factor Analysis
Labs: Trees, Dimension Reduction, Multi-dimensional Scaling, SVM
Consider Covariance Analysis Example 6.9, Spring-Mass
Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 10b, April 8, 2016
Chapter_19 Factor Analysis
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Factor Analysis (Principal Components) Output
Lecture 16. Classification (II): Practical Considerations
Support Vector Machines 2
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 10, 2015 Labs: Cross Validation, RandomForest, Multi- Dimensional Scaling, Dimension Reduction, Factor Analysis

Advertisements Mike Schroepfer, Facebook CTO, will be lecturing and doing Q&A for Data and Society (CSCI 4967/6963) on Friday, April 24 in Walker 5113 from 9 to 11. Who would attend (must confirm)? 2

If you did not complete svm Lab9b_svm(1,11}_2015.R 3

Cross-validation - coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y

Lab11_11_2014.R > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations

Lab11b_12_2014.R > cvFits 5-fold CV results: Fit CV 1 LS MM LTS Best model: CV "MM" 6

50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 7

cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw Best model: reweighted raw "0.75" "0.75" 8

Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 9

cvFitsLmrob 5-fold CV results: tuning.psi CV Optimal tuning parameter: tuning.psi CV

Lab11b_18 mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1]

Lab11b_18 # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1]

randomForest > library(e1071) > library(rpart) > library(mlbench) # etc. > data(kyphosis) > require(randomForest) # or library(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results > importance(fitKF) # importance of each predictor # what else can you do? data(swiss) # fertility? Lab10b_rf3_2015.R data(Glass,package=“mlbench”) # Type ~ ? data(Titanic) # Survived ~. Find - Mileage~Price + Country + Reliability + Type 13

MDS Lab8b_mds1_2015.R Lab8b_mds2_2015.R Lab8b_mds3_2015.R lhttp:// l to/2013/01/23/MDS-in-R.htmlhttp://gastonsanchez.com/blog/how- to/2013/01/23/MDS-in-R.html 14

R – many ways (of course) library(igraph) g <- graph.full(nrow(dist.au)) V(g)$label <- city.names layout <- layout.mds(g, dist = as.matrix(dist.au)) plot(g, layout = layout, vertex.size = 3) 15

Distances between Australian cities # dist.au <- read.csv(" ta/dist-Aus.csv") # Lab8b_mds1_2015.R row.names(dist.au) <- dist.au[, 1] dist.au <- dist.au[, -1] dist.au ## A AS B D H M P S ## A ## AS ## B ## D ## H ## M ## P ## S

Distances between Australian cities fit <- cmdscale(dist.au, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x, y, pch = 19, xlim = range(x) + c(0, 600)) city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", "Melbourne", "Perth", "Sydney") text(x, y, pos = 4, labels = city.names) Try the other MDS functions... 17

In R function (library) cmdscale() (stats) smacofSym() (smacof) wcmdscale() (vegan) pco() (ecodist) pco() (labdsv) pcoa() (ape) Only stats is loaded by default, and the rest are not installed by default 18

Do these dimension reductions Lab8b_dr1_2015.R Lab8b_dr2_2015.R Lab8b_dr3_2015.R Lab8b_dr4_2015.R 19

Factor Analysis data(iqitems) # data(ability) ability.irt <- irt.fa(ability) ability.scores <- score.irt(ability.irt,ability) data(attitude) cor(attitude) # Compute eigenvalues and eigenvectors of the correlation matrix. pfa.eigen<-eigen(cor(attitude)) pfa.eigen$values # set a value for the number of factors (for clarity) factors<-2 # Extract and transform two components. pfa.eigen$vectors [, 1:factors ] %*% + diag ( sqrt (pfa.eigen$values [ 1:factors ] ),factors,factors ) 20

Glass index <- 1:nrow(Glass) testindex <- sample(index, trunc(length(index)/3)) testset <- Glass[testindex,] trainset <- Glass[-testindex,] Cor(testset) Factor Analysis? 21

Try these example_exploratoryFactorAnalysis.R on dataset_exploratoryFactorAnalysis.csv – tutorial-series-exploratory-factor.htmlhttp://rtutorialseries.blogspot.com/2011/10/r- tutorial-series-exploratory-factor.html mlhttp:// ml 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysihttp://stats.stackexchange.com/questions/157 6/what-are-the-differences-between-factor- analysis-and-principal-component-analysi Lab10b_fa{1,2,4,5}_2015.R 22