1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.

Slides:

Advertisements

Similar presentations

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.

Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.

1. 2 Type your ID # and press the ENTER key to continue YOU MUST LOG IN FOR PROPER CREDIT.

Data Collection & Processing Hand Grip Strength P textbook.

Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.

Statistics: Unlocking the Power of Data Lock 5 Afternoon Session Using Lock5 Statistics: Unlocking the Power of Data Patti Frazer Lock University of Kentucky.

Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.

Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.

N318b Winter 2002 Nursing Statistics Hypothesis and Inference tests, Type I and II errors, p-values, Confidence Intervals Lecture 5.

A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.

Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.

JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004.

Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.

Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.

Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,

Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.

Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.

Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.

1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.

Introduction to Classifiers Fujinaga. Bayes (optimal) Classifier (1) A priori probabilities: and Decision rule: given and decide if and probability of.

Decision Tree Lab. Load in iris data: Display iris data as a sanity.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2016 Room 150 Harvill.

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4963/ITWS-6965

Using the models, prediction, deciding

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

More Bayes, Decision trees, and cross-validation

Data Analytics – ITWS-4600/ITWS-6600

Clustering CSC 600: Data Mining Class 21.

Group 1 Lab 2 exercises /assignment 2

Homework Assignment: Homework #1 is due Wednesday at 4:15 PM

Classification, Clustering and Bayes…

Data Analytics – ITWS-4963/ITWS-6965

CS 235 Decision Tree Classification

Discriminant Analysis

Principal Component Analysis

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Weighted kNN, clustering, “early” trees and Bayesian

Weka Free and Open Source ML Suite Ian Witten & Eibe Frank

Classification and clustering - interpreting and exploring data

Classification, Clustering and Bayes…

Assignment 2 (in lab) Peter Fox and Greg Hughes

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Classification, Clustering and Bayes…

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Data Mining CSCI 307, Spring 2019 Lecture 6

Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes

Plot tools/ tips r/ pairs, gpairs, scatterplot.matrix, clustergram, etc. data() # precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere More script fragments in Lab6b_*_2014.R on the web site (escience.rpi.edu/data/DA ) 2

Weighted KNN? require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] iris.valid <- iris[val,] iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") summary(iris.kknn) fit <- fitted(iris.kknn) table(iris.valid$Species, fit) pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1]) 3

4 Try Lab6b_8_2014.R

New dataset - ionosphere require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) # vary kernel (fit.train1 <- train.kknn(class ~., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) #alter distance (fit.train2 <- train.kknn(class ~., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) 5

Cluster plotting source(" content/uploads/2012/01/source_https.r.txt") # source code from github require(RCurl) require(colorspace) source_https(" snippets/master/clustergram.r") data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # scaling clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale 6

Clustergram 7

Any good? set.seed(500) Data2 <- scale(iris[,-5]) par(cex.lab = 1.2, cex.main =.7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data2, k.range = 2:8, line.width =.004, add.center.points = T) 8

9

How can you tell it is good? set.seed(250) Data <- rbind(cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))) clustergram(Data, k.range = 2:5, line.width =.004, add.center.points = T) 10

More complex… set.seed(250) Data <- rbind(cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))) clustergram(Data, k.range = 2:8, line.width =.004, add.center.points = T) 11

12 Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together) Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters Run the plot multiple times to observe the stability of the cluster formation (and location)

13

Swiss - pairs 14 pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

ctree 15 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

Hierarchical clustering 16 > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

scatterplotMatrix 17

require(lattice); splom(swiss) 18

Decision tree (reminder) > str(iris) 'data.frame':150 obs. of 5 variables: $ Sepal.Length: num $ Sepal.Width : num $ Petal.Length: num $ Petal.Width : num $ Species : Factor w/ 3 levels "setosa","versicolor",..: > str(swiss) … 19

Beyond plot: pairs pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]) 20 Try Lab6b_2_2014.R - USJudgeRatings

Try hclust for iris 21

gpairs(iris) 22 Try Lab6b_3_2014.R

Better scatterplots 23 install.packages("car") require(car) scatterplotMatrix(iris) Try Lab6b_4_2014.R

splom(iris) # default 24 Try Lab6b_7_2014.R

splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6,.8, length.out = 4), y = seq(.9,.6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) }) parallelplot(~iris[1:4] | Species, iris) parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90))) > Lab6b_7_2014.R 25

26

27

28

29

Ctree > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = )* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = ) Petal.Length <= 4.8; criterion = 0.999, statistic = )* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46 30

plot(iris_ctree) 31 Try Lab6b_5_2014.R > plot(iris_ctree, type="simple”) # try this

Try these on mapmeans, etc. 32

Something simpler – kmeans and… > mapmeans<- data.frame(as.numeric(mapcoord$NEIGHBORHOOD), adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobjnew<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobjnew,method=c("centers","classes")) Others? 33

Plotting clusters (DIY) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions #library(fpc) plotcluster(mapmeans, mapobj$cluster) dendogram? library(fpc) cluster.stats 34

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa versicolor virginica setosa versicolor virginica pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[u nclass(iris$Species)])

Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) classifier$apriori classifier$tables$Petal.Length plot(function(x) dnorm(x, 1.462, ), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, ), add=TRUE, col="blue") curve(dnorm(x, 5.552, ), add=TRUE, col = "green") 36

37

Using a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~., data = Titanic) > mdl 38 Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~., data = Titanic) A-priori probabilities: Survived No Yes Conditional probabilities: Class Survived 1st 2nd 3rd Crew No Yes Sex Survived Male Female No Yes Age Survived Child Adult No Yes Try Lab6b_9_2014.R

ench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class) 39

Exercise for you > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female > margin.table(HairEyeColor,c(1,3)) Sex Hair Male Female Black Brown Red Blond How would you construct a naïve Bayes classifier and test it? 40

Assignment 5 Project proposals… Let’s look at it Assignment 4 - how is it going – assume you all start after today? 41

Assignment 6 preview Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). –Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Please use the section numbering below for your written submission for this assignment. Introduction (2%) Data Description (3%) Analysis (8%) Model Development (8%) Conclusions and Discussion (4%) Oral presentation (5%) (10 mins) 42

Assignments to come Term project (6). Due ~ week 13/ 14 – early May. 30% (25% written, 5% oral; individual). Available after spring break. Assignment 7: Predictive and Prescriptive Analytics. Due ~ week % (15% written; individual); 43

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 44