1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.
Visual approaches for PCA/DR Screeplot - A plot, in descending order of magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable. 2
require(graphics) ## the variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate prcomp(USArrests) # inappropriate prcomp(USArrests, scale = TRUE) prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) plot(prcomp(USArrests)) summary(prcomp(USArrests, scale = TRUE)) biplot(prcomp(USArrests, scale = TRUE)) 3
prcomp > prcomp(USArrests) # inappropriate Standard deviations: [1] Rotation: PC1 PC2 PC3 PC4 Murder Assault UrbanPop Rape > prcomp(USArrests, scale = TRUE) Standard deviations: [1] Rotation: PC1 PC2 PC3 PC4 Murder Assault UrbanPop Rape
screeplot 5
> prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) Standard deviations: [1] Rotation: PC1 PC2 PC3 Murder Assault Rape > summary(prcomp(USArrests, scale = TRUE)) Importance of components: PC1 PC2 PC3 PC4 Standard deviation Proportion of Variance Cumulative Proportion
bigplot 7
Line plots lab 6 prcomp (top) and metaPCA (bottom) 8 Looking for convergence as iteration increases EigenAngleRobustAngleSparseAngle
prostate data (lab 7) 2D plot. 9
Lab 9 library(dr) data(ais) # default fitting method is "sir" s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+ log(Hc)+log(Ferr),data=ais) # Refit, using a different function for slicing to agree with arc. summary(s1 <- update(s0,slice.function=dr.slices.arc)) # Refit again, using save, with 10 slices; the default is max(8,ncol+3) summary(s2<-update(s1,nslices=10,method="save")) # Refit, using phdres. Tests are different for phd, and not # Fit using phdres; output is similar for phdy, but tests are not justifiable. summary(s3<- update(s1,method="phdres")) # fit using ire: summary(s4 <- update(s1,method="ire")) # fit using Sex as a grouping variable. s5 <- update(s4,group=~Sex) 10
> s0 dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais) Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Eigenvalues: [1]
> summary(s1 <- update(s0,slice.function=dr.slices.arc)) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc) Method: sir with 11 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df p.value 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D
> summary(s2<-update(s1,nslices=10,method="save")) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, nslices = 10, method = "save") Method: save with 10 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen) 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D
S0 v. S2 14
S3 and S4 15
Infrastructure tools In R Studio –Install the rmongodb package – project.org/web/packages/rmongodb/vignettes/rm ongodb_cheat_sheet.pdfhttp://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_cheat_sheet.pdf – project.org/web/packages/rmongodb/vignettes/rm ongodb_introduction.htmlhttp://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_introduction.html MongoDB - – couchdb-vs-redis - get familiar with the choiceshttp://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis General idea: –These are “backend” stores that can do various “things” 16
Back-ends Files (e.g. csv), application files (e.g. Rdata, xls, mat, …) – essentially for reading/input Databases – for reading and writing –Also – for advanced operations inside the database!! –Operations range from simple summaries to array operations and analytics functions –Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.) –Overhead is also around their internal storage formats (e.g. BSON for MongoDB) 17
Functions versus languages Libraries for R mean that you code in R and call functions and the result returns into R –Whatever the function does (i.e. how it is implemented) is what you get (subject to setting parameters) Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database –Cost is learning this new language 18
Example layout - Hadoop 19
Relating Open-Source and Commercial 20
Even further ml#databasehttp://projects.apache.org/indexes/category.ht ml#database –Hadoop (MapReduce) – distributed execution (via disk when data is large) –Pig ( ) –HIVE ( ) –Spark – in memory (RSpark still not easy to find/ install)
~ Objectives Provide an application, i.e. predictive/ prescriptive model view of data analytics by focusing on the “front-end” (Rstudio) Over a variety of data… Provide enough of a view of the back-end to know how you will need to interface to them (both open-source and commercial) 22
Layers across the Analytics Stack 23
Time for assignments 24