Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Similar presentations


Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time."— Presentation transcript:

1 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

2 Visual approaches for PCA/DR Screeplot - A plot, in descending order of magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable. 2

3 require(graphics) ## the variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate prcomp(USArrests) # inappropriate prcomp(USArrests, scale = TRUE) prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) plot(prcomp(USArrests)) summary(prcomp(USArrests, scale = TRUE)) biplot(prcomp(USArrests, scale = TRUE)) 3

4 prcomp > prcomp(USArrests) # inappropriate Standard deviations: [1] 83.732400 14.212402 6.489426 2.482790 Rotation: PC1 PC2 PC3 PC4 Murder 0.04170432 -0.04482166 0.07989066 -0.99492173 Assault 0.99522128 -0.05876003 -0.06756974 0.03893830 UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914 Rape 0.07515550 0.20071807 0.97408059 0.07232502 > prcomp(USArrests, scale = TRUE) Standard deviations: [1] 1.5748783 0.9948694 0.5971291 0.4164494 Rotation: PC1 PC2 PC3 PC4 Murder -0.5358995 0.4181809 -0.3412327 0.64922780 Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 Rape -0.5434321 -0.1673186 0.8177779 0.08902432 4

5 screeplot 5

6 > prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) Standard deviations: [1] 1.5357670 0.6767949 0.4282154 Rotation: PC1 PC2 PC3 Murder -0.5826006 0.5339532 -0.6127565 Assault -0.6079818 0.2140236 0.7645600 Rape -0.5393836 -0.8179779 -0.1999436 > summary(prcomp(USArrests, scale = TRUE)) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.5749 0.9949 0.59713 0.41645 Proportion of Variance 0.6201 0.2474 0.08914 0.04336 Cumulative Proportion 0.6201 0.8675 0.95664 1.00000 6

7 bigplot 7

8 Line plots lab 6 prcomp (top) and metaPCA (bottom) 8 Looking for convergence as iteration increases EigenAngleRobustAngleSparseAngle http://cran.r-project.org/web/packages/MetaPCA/MetaPCA.pdf

9 prostate data (lab 7) 2D plot. 9

10 Lab 9 library(dr) data(ais) # default fitting method is "sir" s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+ log(Hc)+log(Ferr),data=ais) # Refit, using a different function for slicing to agree with arc. summary(s1 <- update(s0,slice.function=dr.slices.arc)) # Refit again, using save, with 10 slices; the default is max(8,ncol+3) summary(s2<-update(s1,nslices=10,method="save")) # Refit, using phdres. Tests are different for phd, and not # Fit using phdres; output is similar for phdy, but tests are not justifiable. summary(s3<- update(s1,method="phdres")) # fit using ire: summary(s4 <- update(s1,method="ire")) # fit using Sex as a grouping variable. s5 <- update(s4,group=~Sex) 10

11 > s0 dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais) Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) 0.150963358 -0.0501785457 0.10898336 -0.002210206 log(Wt) -0.916480522 -0.1942298625 -0.20123696 -0.089722026 log(Hg) -0.131538894 0.6854750758 0.71997546 -0.663097774 log(Ht) -0.093358860 -0.0433408964 0.46445398 0.290838658 log(WCC) 0.004467838 0.0001833808 0.04497590 0.071904557 log(RCC) -0.188973540 0.3475652934 0.29496908 0.037056363 log(Hc) 0.274758965 -0.6058301419 -0.34196615 0.678877114 log(Ferr) -0.005631238 0.0130588502 -0.08702709 0.015547214 Eigenvalues: [1] 0.95766163 0.24504161 0.10707594 0.09041305 11

12 > summary(s1 <- update(s0,slice.function=dr.slices.arc)) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc) Method: sir with 11 slices, n = 202. Slice Sizes: 19 19 19 19 19 19 19 18 18 18 15 Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) 0.143177 -0.0476079 -0.02815 0.003785 log(Wt) -0.879504 -0.1425841 0.23303 -0.094970 log(Hg) -0.195963 0.6318503 0.24483 -0.509424 log(Ht) -0.058923 -0.1100757 -0.87893 0.217803 log(WCC) -0.007276 -0.0029772 -0.05309 0.043056 log(RCC) -0.167736 0.3924936 -0.19711 -0.213689 log(Hc) 0.368652 -0.6418658 -0.26373 0.796849 log(Ferr) -0.002697 0.0002593 0.03492 0.039116 12 Dir1 Dir2 Dir3 Dir4 Eigenvalues 0.9572 0.2275 0.09368 0.07319 R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864 Large-sample Marginal Dimension Tests: Stat df p.value 0D vs >= 1D 284.78 80 0.00000 1D vs >= 2D 91.43 63 0.01113 2D vs >= 3D 45.48 48 0.57690 3D vs >= 4D 26.55 35 0.84694

13 > summary(s2<-update(s1,nslices=10,method="save")) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, nslices = 10, method = "save") Method: save with 10 slices, n = 202. Slice Sizes: 21 21 20 20 20 25 24 22 20 9 Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) 0.127709 -0.00907 0.01018 -0.06144 log(Wt) -0.905004 -0.07107 -0.15734 0.25774 log(Hg) -0.056187 0.50674 -0.34064 -0.38087 log(Ht) 0.399868 0.36613 0.68439 -0.54216 log(WCC) 0.032608 0.02733 0.02277 0.03474 log(RCC) -0.008463 0.15137 -0.24136 -0.47219 log(Hc) -0.021630 -0.76164 0.57591 0.51526 log(Ferr) 0.002116 -0.01670 0.01631 -0.03360 13 Dir1 Dir2 Dir3 Dir4 Eigenvalues 0.9389 0.6611 0.5129 0.4653 R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989 Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen) 0D vs >= 1D 378.3 324 0.02012 0.1071 1D vs >= 2D 279.6 252 0.11214 0.3116 2D vs >= 3D 179.9 189 0.67101 0.5160 3D vs >= 4D 134.3 135 0.50176 0.2786

14 S0 v. S2 14

15 S3 and S4 15

16 Infrastructure tools In R Studio –Install the rmongodb package –http://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_cheat_sheet.pdfhttp://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_cheat_sheet.pdf –http://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_introduction.htmlhttp://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_introduction.html MongoDB - http://www.mongodb.org/http://www.mongodb.org/ –http://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis - get familiar with the choiceshttp://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis General idea: –These are “backend” stores that can do various “things” 16

17 Back-ends Files (e.g. csv), application files (e.g. Rdata, xls, mat, …) – essentially for reading/input Databases – for reading and writing –Also – for advanced operations inside the database!! –Operations range from simple summaries to array operations and analytics functions –Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.) –Overhead is also around their internal storage formats (e.g. BSON for MongoDB) 17

18 Functions versus languages Libraries for R mean that you code in R and call functions and the result returns into R –Whatever the function does (i.e. how it is implemented) is what you get (subject to setting parameters) Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database –Cost is learning this new language 18

19 Example layout - Hadoop 19

20 Relating Open-Source and Commercial 20

21 Even further http://projects.apache.org/indexes/category.ht ml#databasehttp://projects.apache.org/indexes/category.ht ml#database –Hadoop (MapReduce) – distributed execution (via disk when data is large) –Pig (http://wiki.apache.org/pig/RunPig )http://wiki.apache.org/pig/RunPig –HIVE (http://hive.apache.org/releases.html )http://hive.apache.org/releases.html –Spark – in memory (RSpark still not easy to find/ install) http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/ http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/ 21

22 ~ Objectives Provide an application, i.e. predictive/ prescriptive model view of data analytics by focusing on the “front-end” (Rstudio) Over a variety of data… Provide enough of a view of the back-end to know how you will need to interface to them (both open-source and commercial) 22

23 Layers across the Analytics Stack 23

24 Time for assignments 24


Download ppt "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time."

Similar presentations


Ads by Google