1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.

Slides:



Advertisements
Similar presentations
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.
BENJAMIN GAMBOA, RESEARCH ANALYST CRAFTON HILLS COLLEGE RESEARCHING: ALPHA TO ZETA.
1 Assessing Normality and Data Transformations Many statistical methods require that the numeric variables we are working with have an approximate normal.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Elementary Statistics Larson Farber 9 Correlation and Regression.
Multiple regression analysis
Stat 512 – Lecture 18 Multiple Regression (Ch. 11)
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.
Quantitative Methods – Week 7: Inductive Statistics II: Hypothesis Testing Roman Studer Nuffield College
Claims about a Population Mean when σ is Known Objective: test a claim.
One-Way Manova For an expository presentation of multivariate analysis of variance (MANOVA). See the following paper, which addresses several questions:
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
MULTIPLE REGRESSION Using more than one variable to predict another.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.
Applied Econometrics 1 Vincent Hogan
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Summary Statistics Review
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2006 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Problem 3.26, when assumptions are violated 1. Estimates of terms: We can estimate the mean response for Failure Time for problem 3.26 from the data by.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and.
Environmental Modeling Basic Testing Methods - Statistics II.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Objectives (BPS chapter 12) General rules of probability 1. Independence : Two events A and B are independent if the probability that one event occurs.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 FINAL EXAMINATION STUDY MATERIAL III A ADDITIONAL READING MATERIAL – INTRO STATS 3 RD EDITION.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2016 Room 150 Harvill.
Data Analytics – ITWS-4963/ITWS-6965
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Spring 2017 Room 150 Harvill Building 9:00 - 9:50 Mondays, Wednesdays.
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Group 1 Lab 2 exercises /assignment 2
Classification, Clustering and Bayes…
Linear regression project
STAT 6304 Final Project Fall, 2016.
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Scott High School Course Registration
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
Hand in your Homework Assignment.
How to see Assessment Task grades and feedback
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2016 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
Multi Linear Regression Lab
Classification, Clustering and Bayes…
Assignment 2 (in lab) Peter Fox and Greg Hughes
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Classification, Clustering and Bayes…
Local Regression, LDA, and Mixed Model Lab
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Group 1 Lab 2 exercises and Assignment 2
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2

Labs Regression –New multivariate dataset kNN –New Abalone dataset Kmeans –(Sort of) New Iris dataset 1 each for Assignment 2 And then general exercises 2

The Dataset(s) Some new ones; dataset_multipleRegression.csv, abalone.csv Code fragments, i.e. they will not run as-is, on the following slides as Lab3b_knn1_2016.R, etc. 3

Remember a few useful cmds head( ) tail( ) summary( ) 4

How does this work? Following slides have 3 lab assignments for you to complete. These should be completed individually Once you complete (one or all), please raise your hand or approach me, or Rahul to review what you obtained (together these =10% of your grade) There is nothing to hand in If you do not complete part/all today that is okay but you will need to schedule a time to show your results 5

Refer to Tuesday slides and Script fragments on website.. 6

Regression (1) Retrieve this dataset: dataset_multipleRegression.csv Using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD), predict the fall enrollment (ROLL) for this year by knowing that UNEM=9% and HGRAD=100,000. Repeat and add per capita income (INC) to the model. Predict ROLL if INC=$30,000 Summarize and compare the two models. Comment on significance 7

Classification (2) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings). Interpretation not required. 8

Clustering (3) The Iris dataset (in R use data(“iris”) to load it) The 5 th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5], ) to assess results 9

Regression Exercises Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model” 10

boxplot(ENVHEALTH,ECOSYSTEM) 11

qqplot(ENVHEALTH,ECOSYSTEM) 12

ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = , p-value = 1.083e  Reject. > shapiro.test(ECOSYSTEM) Shapiro-Wilk normality test data: ECOSYSTEM W = , p-value =  ~reject 13

Kolmogorov- Smirnov - KS test - > ks.test(ENVHEALTH,ECOSYSTEM) Two-sample Kolmogorov-Smirnov test data: ENVHEALTH and ECOSYSTEM D = , p-value = 5.413e-07 alternative hypothesis: two-sided Warning message: In ks.test(ENVHEALTH, ECOSYSTEM) : p-value will be approximate in the presence of ties 14

Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH) 15

Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) 16

Repeat for AIR_E CLIMATE 17

Classification Exercises (Lab3b_knn1_2016.R) > nyt1<-read.csv(“nyt1.csv") > nyt1 0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1]# shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come! 18

Classification Exercises (Lab3b_knn2_2016.R) 2 examples in the script 19

Clustering Exercises Lab3b_kmeans1_2016.R Lab3b_kmeans2_2016.R – plotting up results from the iris clustering 20