Group 1 Lab 2 exercises /assignment 2

Slides:



Advertisements
Similar presentations
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Advertisements

INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Multivariate Methods Pattern Recognition and Hypothesis Testing.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Elementary Statistics Larson Farber 9 Correlation and Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
One-Way Manova For an expository presentation of multivariate analysis of variance (MANOVA). See the following paper, which addresses several questions:
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Objectives (IPS Chapter 2.1)
12.1 Heteroskedasticity: Remedies Normality Assumption.
Data Analysis.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
11.1 Heteroskedasticity: Nature and Detection Aims and Learning Objectives By the end of this session students should be able to: Explain the nature.
8- Multiple Regression Analysis: The Problem of Inference The Normality Assumption Once Again Example 8.1: U.S. Personal Consumption and Personal Disposal.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2016 Room 150 Harvill.
Data Analytics – ITWS-4963/ITWS-6965
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Notes on Logistic Regression
Please hand in Project 4 To your TA.
Assumption of normality
Classification, Clustering and Bayes…
Data Analytics – ITWS-4963/ITWS-6965
STAT 6304 Final Project Fall, 2016.
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Spring 2017 Room 150 Harvill Building 9:00 - 9:50 Mondays, Wednesdays.
Advanced Analytics Using Enterprise Miner
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2016 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
The Science of Predicting Outcome
Assessing Normality and Data Transformations
Cross-validation and Local Regression Lab
Multi Linear Regression Lab
Classification and clustering - interpreting and exploring data
Classification, Clustering and Bayes…
Assignment 2 (in lab) Peter Fox and Greg Hughes
Hypothesis tests for the difference between two proportions
Local Regression, LDA, and Mixed Model Lab
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Lecturer’s desk Projection Booth Screen Screen Harvill 150 renumbered
Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Statistical Models and Machine Learning Algorithms --Review
Classification, Clustering and Bayes…
Column addition with several numbers
Local Regression, LDA, and Mixed Model Lab
Problems of Tutorial 9 (Problem 4.12, Page 120) Download the “Data for Exercise ” from the class website. The data consist of 1 response variable.
Hypothesis Testing for Proportions
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Column addition with several numbers
Group 1 Lab 2 exercises and Assignment 2
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Presentation transcript:

Group 1 Lab 2 exercises /assignment 2 Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600 Group 1, Lab 2, February 2, 2017

Labs 2a. Regression 2b. kNN 2c. Kmeans I.e. 1 each for Assignment 2 New multivariate dataset 2b. kNN New Abalone dataset 2c. Kmeans (Sort of) New Iris dataset I.e. 1 each for Assignment 2 And then general exercises

The Dataset(s) http://aquarius.tw.rpi.edu/html/DA Some new ones; dataset_multipleRegression.csv, abalone.csv Code fragments, i.e. they will not run as-is, on the following slides as group1/lab2_knn1.R, etc.

Remember a few useful cmds head(<object>) tail(<object>) summary(<object>)

How does this work? Following slides have 3 lab assignments for you to complete. These should be completed individually Once you complete (one or all), please raise your hand or approach one of us to review what you obtained (together these = 10% of your grade) There is nothing to hand in If you do not complete part/all in the lab session == that is okay but you will need to schedule a time to show your results

Refer to lecture slides and Script fragments on website..

Regression (1) Retrieve this dataset: dataset_multipleRegression.csv Using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD), predict the fall enrollment (ROLL) for this year by knowing that UNEM=9% and HGRAD=100,000. Repeat and add per capita income (INC) to the model. Predict ROLL if INC=$30,000 Summarize and compare the two models. Comment on significance

Classification (2) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings). Interpretation not required.

Clustering (3) The Iris dataset (in R use data(“iris”) to load it) The 5th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5],<your clustering>) to assess results

End of Lab assignment 2

Regression Exercises Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”

boxplot(ENVHEALTH,ECOSYSTEM)

qqplot(ENVHEALTH,ECOSYSTEM)

ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = 0.9161, p-value = 1.083e-08 ------- Reject. > shapiro.test(ECOSYSTEM) data: ECOSYSTEM W = 0.9813, p-value = 0.02654 ----- ~reject

Kolmogorov- Smirnov - KS test - > ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = 0.2965, p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties

Linear and least-squares > EPI_data <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)

Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”)

Repeat for AIR_E CLIMATE

Classification Exercises (group1/lab2_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

Classification Exercises (group1/lab2_knn2.R) 2 examples in the script

Clustering Exercises group1/lab2_kmeans1.R group1/lab2_kmeans2.R – plotting up results from the iris clustering