1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Dimension reduction (1)
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
Lecture 7: Principal component analysis (PCA)
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Inverse Regression Methods Prasad Naik 7 th Triennial Choice Symposium Wharton, June 16, 2007.
Face Recognition Jeremy Wyatt.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
Evaluating Performance Information for Mapping Algorithms to Advanced Architectures Nayda G. Santiago, PhD, PE Electrical and Computer Engineering Department.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HDNUG 27-March-2007 SQL Server 2005 Suite as a Business Intelligence Solution.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 13a, April 22, 2014 Boosting, dimension reduction and a preview of the return to Big Data.
1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
Secure Systems Research Group - FAU SW Development methodology using patterns and model checking 8/13/2009 Maha B Abbey PhD Candidate.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Science and Big Data Analytics Chap 3: Data Analytics Using R
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Big Data Yuan Xue CS 292 Special topics on.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
MIS2502: Data Analytics Introduction to Advanced Analytics and R.
3 Copyright © 2006, Oracle. All rights reserved. Designing and Developing for Performance.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
Image taken from: slideshare
Unsupervised Learning
Database management system Data analytics system:
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Big Data A Quick Review on Analytical Tools
Spark Presentation.
Part 3 Design What does design mean in different fields?
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Measuring latent variables
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Measuring latent variables
Descriptive Statistics vs. Factor Analysis
Overview of big data tools
Measuring latent variables
Azure Data Lake for First Time Swimmers
Introduction of Week 9 Return assignment 5-2
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Factor Analysis (Principal Components) Output
Measuring latent variables
MapReduce: Simplified Data Processing on Large Clusters
Lecture 16. Classification (II): Practical Considerations
Unsupervised Learning
Pig Hive HBase Zookeeper
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.

Visual approaches for PCA/DR Screeplot - A plot, in descending order of magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable. 2

require(graphics) ## the variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate prcomp(USArrests) # inappropriate prcomp(USArrests, scale = TRUE) prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) plot(prcomp(USArrests)) summary(prcomp(USArrests, scale = TRUE)) biplot(prcomp(USArrests, scale = TRUE)) 3

prcomp > prcomp(USArrests) # inappropriate Standard deviations: [1] Rotation: PC1 PC2 PC3 PC4 Murder Assault UrbanPop Rape > prcomp(USArrests, scale = TRUE) Standard deviations: [1] Rotation: PC1 PC2 PC3 PC4 Murder Assault UrbanPop Rape

screeplot 5

> prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) Standard deviations: [1] Rotation: PC1 PC2 PC3 Murder Assault Rape > summary(prcomp(USArrests, scale = TRUE)) Importance of components: PC1 PC2 PC3 PC4 Standard deviation Proportion of Variance Cumulative Proportion

bigplot 7

Line plots lab 6 prcomp (top) and metaPCA (bottom) 8 Looking for convergence as iteration increases EigenAngleRobustAngleSparseAngle

prostate data (lab 7) 2D plot. 9

Lab 9 library(dr) data(ais) # default fitting method is "sir" s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+ log(Hc)+log(Ferr),data=ais) # Refit, using a different function for slicing to agree with arc. summary(s1 <- update(s0,slice.function=dr.slices.arc)) # Refit again, using save, with 10 slices; the default is max(8,ncol+3) summary(s2<-update(s1,nslices=10,method="save")) # Refit, using phdres. Tests are different for phd, and not # Fit using phdres; output is similar for phdy, but tests are not justifiable. summary(s3<- update(s1,method="phdres")) # fit using ire: summary(s4 <- update(s1,method="ire")) # fit using Sex as a grouping variable. s5 <- update(s4,group=~Sex) 10

> s0 dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais) Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Eigenvalues: [1]

> summary(s1 <- update(s0,slice.function=dr.slices.arc)) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc) Method: sir with 11 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df p.value 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D

> summary(s2<-update(s1,nslices=10,method="save")) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, nslices = 10, method = "save") Method: save with 10 slices, n = 202. Slice Sizes: Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) log(Wt) log(Hg) log(Ht) log(WCC) log(RCC) log(Hc) log(Ferr) Dir1 Dir2 Dir3 Dir4 Eigenvalues R^2(OLS|dr) Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen) 0D vs >= 1D D vs >= 2D D vs >= 3D D vs >= 4D

S0 v. S2 14

S3 and S4 15

Infrastructure tools In R Studio –Install the rmongodb package – project.org/web/packages/rmongodb/vignettes/rm ongodb_cheat_sheet.pdfhttp://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_cheat_sheet.pdf – project.org/web/packages/rmongodb/vignettes/rm ongodb_introduction.htmlhttp://cran.r- project.org/web/packages/rmongodb/vignettes/rm ongodb_introduction.html MongoDB - – couchdb-vs-redis - get familiar with the choiceshttp://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis General idea: –These are “backend” stores that can do various “things” 16

Back-ends Files (e.g. csv), application files (e.g. Rdata, xls, mat, …) – essentially for reading/input Databases – for reading and writing –Also – for advanced operations inside the database!! –Operations range from simple summaries to array operations and analytics functions –Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.) –Overhead is also around their internal storage formats (e.g. BSON for MongoDB) 17

Functions versus languages Libraries for R mean that you code in R and call functions and the result returns into R –Whatever the function does (i.e. how it is implemented) is what you get (subject to setting parameters) Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database –Cost is learning this new language 18

Example layout - Hadoop 19

Relating Open-Source and Commercial 20

Even further ml#databasehttp://projects.apache.org/indexes/category.ht ml#database –Hadoop (MapReduce) – distributed execution (via disk when data is large) –Pig ( ) –HIVE ( ) –Spark – in memory (RSpark still not easy to find/ install)

~ Objectives Provide an application, i.e. predictive/ prescriptive model view of data analytics by focusing on the “front-end” (Rstudio) Over a variety of data… Provide enough of a view of the back-end to know how you will need to interface to them (both open-source and commercial) 22

Layers across the Analytics Stack 23

Time for assignments 24