1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.

Slides:

Advertisements

Similar presentations

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.

Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.

Review of Univariate Linear Regression BMTRY 726 3/4/14.

5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.

Multiple Regression Predicting a response with multiple explanatory variables.

© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.

x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]

SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.

Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.

Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.

Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,

7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,

Baburao Kamble (Ph.D) University of Nebraska-Lincoln

Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.

How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.

Example of Simple and Multiple Regression

BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.

A quick introduction to R prog. 淡江統計陳景祥 (Steve Chen)

PCA Example Air pollution in 41 cities in the USA.

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.

9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.

Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,

 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.

7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.

Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.

Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.

Exercise 8.25 Stat 121 KJ Wang. Votes for Bush and Buchanan in all Florida Counties Palm Beach County (outlier)

Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

Using R for Marketing Research Dan Toomey 2/23/2015

FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.

Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.

STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.

Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Linear Models Alan Lee Sample presentation for STATS 760.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,

Stat 1510: Statistical Thinking and Concepts REGRESSION.

Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.

The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.

1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4600/ITWS-6600

Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y

Group 1 Lab 2 exercises /assignment 2

Data Analytics – ITWS-4963/ITWS-6965

Correlation and regression

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Group 1 Lab 2 exercises and Assignment 2

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Console Editeur : myProg.R 1

Multi Linear Regression Lab

Assignment 2 (in lab) Peter Fox and Greg Hughes

Obtaining the Regression Line in R

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Estimating the Variance of the Error Terms

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Group 1 Lab 2 exercises and Assignment 2

Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models

Classification (2) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings). Interpretation not required. 2

What did you get? See pdf 3

Clustering (3) The Iris dataset (in R use data(“iris”) to load it) The 5 th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5], ) to assess results 4

Return object clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated. centersA matrix of cluster centres. totssThe total sum of squares. withinssVector of within-cluster sum of squares, one component per cluster. tot.withinssTotal within-cluster sum of squares, i.e., sum(withinss). betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss. sizeThe number of points in each cluster. 5

Contingency tables See pdf file 6

Contingency tables > table(nyt1$Impressions,nyt1$Gender) # Contingency table - displays the (multivariate) frequency distribution of the variable. Tests for significance (not now) > table(nyt1$Clicks,nyt1$Gender)

Regression Exercises Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model” 8

Linear and least-squares > EPI_data<- read.csv(”EPI_data.csv") > attach(EPI_data) > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH) 9

Linear and least-squares > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Coefficients: (Intercept) DALY AIR_H WATER_H e e e e-01 > summary(lmENVH) … > cENVH<-coef(lmENVH) 10

Read the documentation! 11

Linear and least-squares > summary(lmENVH) Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e e DALY 5.000e e <2e-16 *** AIR_H 2.500e e <2e-16 *** WATER_H 2.500e e <2e-16 *** p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis

Linear and least-squares Continued: --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 178 degrees of freedom (49 observations deleted due to missingness) Multiple R-squared: 1,Adjusted R-squared: 1 F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16 > names(lmENVH) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" [7] "qr" "df.residual" "na.action" "xlevels" "call" "terms" [13] "model" 13

Object of class lm: An object of class "lm" is a list containing at least the following components: coefficientsa named vector of coefficients residualsthe residuals, that is response minus fitted values. fitted.valuesthe fitted mean values. rankthe numeric rank of the fitted linear model. weights(only for weighted fits) the specified weights. df.residualthe residual degrees of freedom. callthe matched call. termsthe terms object used.terms object used. contrasts(only where relevant) the contrasts used. xlevels(only where relevant) a record of the levels of the factors used in fitting. offsetthe offset used (missing if none were used). yif requested, the response used. xif requested, the model matrix used. modelif requested (the default), the model frame used. 14

> plot(ENVHEALTH,c ol="red") > points(lmENVH$fitte d.values,col="blue") > Huh? 15 Plot original versus fitted

Try again! 16 > plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red") > points(lmENVH$fitted.values,col="blue")

Predict > cENVH<- coef(lmENVH) > DALYNEW<- c(seq(5,95,5)) #2 > AIR_HNEW<- c(seq(5,95,5)) #3 > WATER_HNEW<- c(seq(5,95,5)) #4 17

Predict > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does 18

Predict object returns predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc. If se.fit is TRUE, a list with the following components is returned: fitvector or matrix as above se.fitstandard error of predicted means residual.scaleresidual standard deviations dfdegrees of freedom for residual 19

Output from predict > head(pENV) fit lwr upr 1 NA NA NA NA NA NA … 20

> tail(pENV) fit lwr upr 226 NA NA NA 227 NA NA NA

Read the documentation! 22

Classification Exercises (Lab3b_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1 0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1]# shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come! 23

K Nearest Neighbors (classification) Script – Lab3b_knn1_2015.R > nyt1<-read.csv(“nyt1.csv") … from week 3b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] > head(classif) [1] Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"]# or > length(which(ncorrect)) > What do you conclude? 24

Classification Exercises (Lab3b_knn2_2015.R) 2 examples in the script 25

Clustering Exercises Lab3b_kmeans1.R Lab3b_kmeans2.R – plotting up results from the iris clustering 26

Regression > bronx<- read.xlsx(”sales/rollingsales_bronx.xls",pattern ="BOROUGH",stringsAsFactors=FALSE,sheetI ndex=1,startRow=5,header=TRUE) > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROS S.SQUARE.FEET),data=bronx)  What’s wrong? 27

Clean up… > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROS S.SQUARE.FEET),data=bronx) # > summary(m1) 28

Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** log(GROSS.SQUARE.FEET) <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 2435 DF, p-value: < 2.2e-16 29

Plot > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE)) > abline(m1,col="red",lwd=2) # then > plot(resid(m1)) 30

Another model (2)? Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD Repeat but suppress the intercept (2a) 31

Model 3/4 Model 3 Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD, BUILDING.CLASS.CATEGORY Model 4 Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD*BUILDING.CLASS.CATEG ORY 32

Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 33

34

Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 35

36

Assignment 3 Preliminary and Statistical Analysis. Due ~ March 6. 15% (written) –Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets. 37