Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Slides:



Advertisements
Similar presentations
Exemples instructifs… Représentations graphiques.
Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.
October 6, 2009 Session 6Slide 1 PSC 5940: Running Basic Multi- Level Models in R Session 6 Fall, 2009.
Regression with ARMA Errors. Example: Seat-belt legislation Story: In February 1983 seat-belt legislation was introduced in UK in the hope of reducing.
Multiple Regression Predicting a response with multiple explanatory variables.
Zinc Data SPH 247 Statistical Analysis of Laboratory Data.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
1 Regression Homework Solutions EPP 245/298 Statistical Analysis of Laboratory Data.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
PCA Example Air pollution in 41 cities in the USA.
MATH 3359 Introduction to Mathematical Modeling Project Multiple Linear Regression Multiple Logistic Regression.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.
Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Regression and Analysis Variance Linear Models in R.
Exercise 8.25 Stat 121 KJ Wang. Votes for Bush and Buchanan in all Florida Counties Palm Beach County (outlier)
Collaboration and Data Sharing What have I been doing that’s so bad, and how could it be better? August 1 st, 2010.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Using R for Marketing Research Dan Toomey 2/23/2015
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Environmental Modeling Basic Testing Methods - Statistics III.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
WSUG M AY 2012 EViews, S-Plus and R Damian Staszek Bristol Water.
Lecture 11: Simple Linear Regression
Data Analytics – ITWS-4600/ITWS-6600
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
Group 1 Lab 2 exercises /assignment 2
Classification, Clustering and Bayes…
Data Analytics – ITWS-4963/ITWS-6965
Correlation and regression
REGRESI DENGAN VARABEL FAKTOR/ KUALLTATIF
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
Console Editeur : myProg.R 1
Regression with ARMA Errors
Regression Transformations for Normality and to Simplify Relationships
Multi Linear Regression Lab
Classification, Clustering and Bayes…
Assignment 2 (in lab) Peter Fox and Greg Hughes
Obtaining the Regression Line in R
Local Regression, LDA, and Mixed Model Lab
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Classification, Clustering and Bayes…
Local Regression, LDA, and Mixed Model Lab
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Group 1 Lab 2 exercises and Assignment 2
Presentation transcript:

Lab exercises: working with real datasets, plotting, more regression, kNN and K-means… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600 Group 2, Lab 1, February 9, 2017

Plot tools/ tips http://statmethods.net/advgraphs/layout.html http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/ pairs, gpairs, scatterplot.matrix, clustergram, etc. data() # precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere More script fragments in R will be available on the web site (http://aquarius.tw.rpi.edu/html/DA )

Scripts – work through these See in folder group2/ lab1_pairs1.R lab1_splom.R lab1_gpairs1.R lab1_mosaic.R lab1_spm.R lab1_wknn.R lab1_kknn1.R lab1_kknn2.R lab1_kknn3.R lab1_kmeans1.R lab1_ctree2.R lab1_nyt.R lab1_bronx1.R lab1_bronx2.R

K Nearest Neighbors (classification) Script – group2/lab1_nyt.R > nyt1<-read.csv(“nyt1.csv") … from week 3b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

NYC Housing data http://aquarius.tw.rpi.edu/html/DA/rollingsales_bronx.xls

Bronx 1 = Regression You were reminded that log(0) is … not fun > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] Lab5b_bronx1_2016.R

Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16

Plots – tell me what they tell you!

Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))

How do you interpret this residual plot?

Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))

And this one?

Bronx 2 = complex example See lab1_bronx2.R Manipulation Mapping knn kmeans

KNN! Did you loop over k? { knnpred<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=5) knntesterr<-sum(knnpred!=mappred$class)/length(testid) } knntesterr [1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159 What do you think?

Plot()