Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.
Chapter 12 Simple Regression
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Baburao Kamble (Ph.D) University of Nebraska-Lincoln
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Analyzing Data: Bivariate Relationships Chapter 7.
Example of Simple and Multiple Regression
Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.
Simple Linear Regression
A quick introduction to R prog. 淡江統計 陳景祥 (Steve Chen)
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
Introduction to Linear Regression
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
Decision Tree Lab. Load in iris data: Display iris data as a sanity.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Common Linear & Classification for Machine Learning using Microsoft R
Data Mining Introduction to Classification using Linear Classifiers
Using the models, prediction, deciding
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Transformation: Normalization
Data Analytics – ITWS-4600/ITWS-6600
Clustering CSC 600: Data Mining Class 21.
Group 1 Lab 2 exercises /assignment 2
Data Analytics – ITWS-4963/ITWS-6965
CS 235 Decision Tree Classification
Discriminant Analysis
Principal Component Analysis
Chapter 11 Simple Regression
Advanced Analytics Using Enterprise Miner
Figure 1.1 Rules for the contact lens data.
Exam #3 Review Zuyin (Alvin) Zheng.
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
K-Means Lab.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Classification and clustering - interpreting and exploring data
Assignment 2 (in lab) Peter Fox and Greg Hughes
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Cluster Analysis.
Feature Selection Methods
Statistical Models and Machine Learning Algorithms --Review
Section 11-1 Review and Preview
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Group 1 Lab 2 exercises and Assignment 2
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Regression, classification and clustering - interpreting and exploring data Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600 Group 2, Module 5, February 6, 2017

Regression Retrieve this dataset: dataset_multipleRegression.csv Using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD), predict the fall enrollment (ROLL) for this year by knowing that UNEM=9% and HGRAD=100,000. Repeat and add per capita income (INC) to the model. Predict ROLL if INC=$30,000 Summarize and compare the two models. Comment on significance

Object of class lm: An object of class "lm" is a list containing at least the following components: coefficients a named vector of coefficients residuals the residuals, that is response minus fitted values. fitted.values the fitted mean values. rank the numeric rank of the fitted linear model. weights (only for weighted fits) the specified weights. df.residual the residual degrees of freedom. call the matched call. terms the terms object used. contrasts (only where relevant) the contrasts used. xlevels (only where relevant) a record of the levels of the factors used in fitting. offset the offset used (missing if none were used). y if requested, the response used. x if requested, the model matrix used. model if requested (the default), the model frame used.

Regression Exercises (lab2) Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions across all the columns and build up an EPI “model” We will be interpreting and discussing these models next module (week)!

Classification abalone.csv dataset = predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings).

Clustering The Iris dataset (in R use data(“iris”) to load it) The 5th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5],<your clustering>) to assess results

Return object cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e., sum(withinss). betweenss The between-cluster sum of squares, i.e. totss-tot.withinss. size The number of points in each cluster.

K-Means Algorithm: Example Output

Describe v. Predict

Predict = Decide

Contingency table

Classification or clustering on nyt dataset(s) "Age","Gender","Impressions","Clicks","Signed_In" 36,0,3,0,1 73,1,3,0,1 30,0,3,0,1 49,1,3,0,1 47,1,11,0,1 47,0,11,1,1 (nyt datasets) Model e.g.: If Age<45 and Impressions >5 then Gender=female (0) Age ranges? 41-45, 46-50, etc?

Contingency tables > table(nyt1$Impressions,nyt1$Gender) # 0 1 1 69 85 2 389 395 3 975 937 4 1496 1572 5 1897 2012 6 1822 1927 7 1525 1696 8 1142 1203 9 722 711 10 366 400 11 214 200 12 86 101 13 41 43 14 10 9 15 5 7 16 0 4 17 0 1 Contingency table - displays the (multivariate) frequency distribution of the variable. Tests for significance (not now) > table(nyt1$Clicks,nyt1$Gender) 0 1 1 10335 10846 2 415 440 3 9 17

Classification Exercises (group1/lab2_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

K Nearest Neighbors (classification) > nyt1<-read.csv(“nyt1.csv") … from week 3 lab slides or scripts > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

Weighted KNN… require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] iris.valid <- iris[val,] iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") summary(iris.kknn) fit <- fitted(iris.kknn) table(iris.valid$Species, fit) pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

summary Call: kknn(formula = Species ~ ., train = iris.learn, test = iris.valid, distance = 1, kernel = "triangular") Response: "nominal" fit prob.setosa prob.versicolor prob.virginica 1 versicolor 0 1.00000000 0.00000000 2 versicolor 0 1.00000000 0.00000000 3 versicolor 0 0.91553003 0.08446997 4 setosa 1 0.00000000 0.00000000 5 virginica 0 0.00000000 1.00000000 6 virginica 0 0.00000000 1.00000000 7 setosa 1 0.00000000 0.00000000 8 versicolor 0 0.66860033 0.33139967 9 virginica 0 0.22534461 0.77465539 10 versicolor 0 0.79921042 0.20078958 virginica 0 0.00000000 1.00000000 ......

table fit setosa versicolor virginica setosa 15 0 0 versicolor 0 19 1 virginica 0 2 13

pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

Ctrees? We want a means to make decisions – so how about a “if this then this otherwise that” approach == tree methods, or branching. Conditional Inference – what is that? Instead of: if (This1 .and. This2 .and. This3 .and. …)

Decision tree classifier

Conditional Inference Tree > require(party) # don’t get me started! > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

Ctree > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46

plot(iris_ctree) > plot(iris_ctree, type="simple”) # try this

Beyond plot: pairs pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

But the means for branching.. Do not have to be threshold based ( ~ distance) Can be cluster based = I am more similar to you if I possess these attributes (in this range) Thus: trees + cluster = hierarchical clustering In R: hclust (and others) in stats package

Try hclust for iris

gpairs(iris)

Better scatterplots install.packages("car") require(car) scatterplotMatrix(iris)

splom(iris) # default

splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6, .8, length.out = 4), y = seq(.9, .6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) }) parallelplot(~iris[1:4] | Species, iris) parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

Shift the dataset…

Hierarchical clustering > d <- dist(as.matrix(mtcars)) > hc <- hclust(d) > plot(hc)

Data(swiss) - pairs pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

ctree require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

Hierarchical clustering > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

scatterplotMatrix

require(lattice); splom(swiss)

Start collecting Your favorite plotting routines Get familiar with annotating plots

Assignment 3 Preliminary and Statistical Analysis. Due February 24. 15% (written) Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt2…31 datasets. See LMS … for Assignment and details.