Group 1 Lab 2 exercises and Assignment 2

Slides:



Advertisements
Similar presentations
Topic 9: Remedies.
Advertisements

Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Multivariate Methods Pattern Recognition and Hypothesis Testing.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Lecture 5 Correlation and Regression
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Data Analysis.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
11.1 Heteroskedasticity: Nature and Detection Aims and Learning Objectives By the end of this session students should be able to: Explain the nature.
Review of Hypothesis Testing: –see Figures 7.3 & 7.4 on page 239 for an important issue in testing the hypothesis that  =20. There are two types of error.
Data Analytics – ITWS-4963/ITWS-6965
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600
Non-parametric test ordinal data
Course Review Questions will not be all on one topic, i.e. questions may have parts covering more than one area.
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Section 11.2 Day 4.
Group 1 Lab 2 exercises /assignment 2
Assumption of normality
Classification, Clustering and Bayes…
Data Analytics – ITWS-4963/ITWS-6965
One sample t-test and z-test
PSYCH 625 Education on your terms/snaptutorial.com.
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Analgesic study with three treatments crossed with gender.
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
The Science of Predicting Outcome
Assessing Normality and Data Transformations
Classification and clustering - interpreting and exploring data
Classification, Clustering and Bayes…
Assignment 2 (in lab) Peter Fox and Greg Hughes
Loss.
Local Regression, LDA, and Mixed Model Lab
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Labs: Trees, Dimension Reduction, Multi-dimensional Scaling, SVM
Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.
CHAPTER 12 More About Regression
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
The loss function, the normal equation,
Cross-validation and Local Regression Lab
Cross-validation and Local Regression Lab
Chapter 14 Inference for Regression
Classification, Clustering and Bayes…
Local Regression, LDA, and Mixed Model Lab
Inference for Regression
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Problem 3.26, when assumptions are violated
Presentation transcript:

Group 1 Lab 2 exercises and Assignment 2 Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960 Group 1, Lab 2, September 14, 2018

Labs 2a. Regression 2b. kNN 2c. Kmeans Do all three for Assignment 2 New multivariate dataset 2b. kNN New Abalone dataset 2c. Kmeans (Sort of) New Iris dataset Do all three for Assignment 2 And then general exercises

The Dataset(s) http://aquarius.tw.rpi.edu/html/DA See slides: DataAnalytics2018Fall_Assignment_2.pptx on LMS or https://tw.rpi.edu/web/Courses/DataAnalytics/2018Fall (under week 3) Code fragments, i.e. they will not run as-is, on the following slides as group1/lab2_knn1.R, etc.

Remember a few useful cmds head(<object>) tail(<object>) summary(<object>)

Regression Exercises Using the EPI (under /EPI on web) dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”

boxplot(ENVHEALTH,ECOSYSTEM)

qqplot(ENVHEALTH,ECOSYSTEM)

ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = 0.9161, p-value = 1.083e-08 ------- Reject. > shapiro.test(ECOSYSTEM) data: ECOSYSTEM W = 0.9813, p-value = 0.02654 ----- ~reject

Kolmogorov- Smirnov - KS test - > ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = 0.2965, p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties

Linear and least-squares > EPI_data <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)

Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”)

Repeat for AIR_E CLIMATE

Classification Exercises (group1/lab2_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

Classification Exercises (group1/lab2_knn2.R) 2 examples in the script

Clustering Exercises group1/lab2_kmeans1.R group1/lab2_kmeans2.R – plotting up results from the iris clustering