Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.

Similar presentations


Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2."— Presentation transcript:

1 1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2

2 Labs Regression –New multivariate dataset kNN –New Abalone dataset Kmeans –(Sort of) New Iris dataset 1 each for Assignment 2 And then general exercises 2

3 The Dataset(s) http://aquarius.tw.rpi.edu/html/DA Some new ones; dataset_multipleRegression.csv, abalone.csv Code fragments, i.e. they will not run as-is, on the following slides as Lab3b_knn1_2016.R, etc. 3

4 Remember a few useful cmds head( ) tail( ) summary( ) 4

5 How does this work? Following slides have 3 lab assignments for you to complete. These should be completed individually Once you complete (one or all), please raise your hand or approach me, or Rahul to review what you obtained (together these =10% of your grade) There is nothing to hand in If you do not complete part/all today that is okay but you will need to schedule a time to show your results 5

6 Refer to Tuesday slides and Script fragments on website.. 6

7 Regression (1) Retrieve this dataset: dataset_multipleRegression.csv Using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD), predict the fall enrollment (ROLL) for this year by knowing that UNEM=9% and HGRAD=100,000. Repeat and add per capita income (INC) to the model. Predict ROLL if INC=$30,000 Summarize and compare the two models. Comment on significance 7

8 Classification (2) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings). Interpretation not required. 8

9 Clustering (3) The Iris dataset (in R use data(“iris”) to load it) The 5 th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5], ) to assess results 9

10 Regression Exercises Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model” 10

11 boxplot(ENVHEALTH,ECOSYSTEM) 11

12 qqplot(ENVHEALTH,ECOSYSTEM) 12

13 ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = 0.9161, p-value = 1.083e-08 -------  Reject. > shapiro.test(ECOSYSTEM) Shapiro-Wilk normality test data: ECOSYSTEM W = 0.9813, p-value = 0.02654 -----  ~reject 13

14 Kolmogorov- Smirnov - KS test - > ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = 0.2965, p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties 14

15 Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH) 15

16 Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) 16

17 Repeat for AIR_E CLIMATE 17

18 Classification Exercises (Lab3b_knn1_2016.R) > nyt1<-read.csv(“nyt1.csv") > nyt1 0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1]# shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come! 18

19 Classification Exercises (Lab3b_knn2_2016.R) 2 examples in the script 19

20 Clustering Exercises Lab3b_kmeans1_2016.R Lab3b_kmeans2_2016.R – plotting up results from the iris clustering 20


Download ppt "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2."

Similar presentations


Ads by Google