1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Random Forest Predrag Radenković 3237/10

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.

Data Mining Classification: Alternative Techniques

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.

Indian Statistical Institute Kolkata

CMPUT 466/551 Principal Source: CMU

Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning: An Introduction

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7b, March 13, 2015 Interpreting weighted kNN, decision trees, cross-validation, dimension reduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.

WEKA Machine Learning Toolbox. You can install Weka on your computer from

Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

An Exercise in Machine Learning

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.

KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 

Decision Tree Lab. Load in iris data: Display iris data as a sanity.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 7a, March 8, 2016 Decision trees, cross-validation.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Using the models, prediction, deciding

More Bayes, Decision trees, and cross-validation

Clustering CSC 600: Data Mining Class 21.

Bagging and Random Forests

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Discriminant Analysis

The Elements of Statistical Learning

Basic machine learning background with Python scikit-learn

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Weighted kNN, clustering, “early” trees and Bayesian

Weka Free and Open Source ML Suite Ian Witten & Eibe Frank

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

Classification and clustering - interpreting and exploring data

Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Data Mining CSCI 307, Spring 2019 Lecture 6

Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian inference

Contents 2

Weighted KNN require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] # train iris.valid <- iris[val,]# test iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") # Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal". 3

names(iris.kknn) fitted.valuesVector of predictions. CLMatrix of classes of the k nearest neighbors. WMatrix of weights of the k nearest neighbors. DMatrix of distances of the k nearest neighbors. CMatrix of indices of the k nearest neighbors. probMatrix of predicted class probabilities. responseType of response variable, one of continuous, nominal or ordinal. distanceParameter of Minkowski distance. callThe matched call. termsThe 'terms' object used. 4

Look at the output > head(iris.kknn$W) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,]

Look at the output > head(iris.kknn$D) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,]

Look at the output > head(iris.kknn$C) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,] > head(iris.kknn$prob) setosa versicolor virginica [1,] [2,] [3,] [4,] [5,] [6,]

Look at the output > head(iris.kknn$fitted.values) [1] virginica setosa versicolor setosa virginica virginica Levels: setosa versicolor virginica 8

Contingency tables fitiris <- fitted(iris.kknn) table(iris.valid$Species, fitiris) fitiris setosa versicolor virginica setosa versicolor virginica # rectangular – no weight fitiris2 setosa versicolor virginica setosa versicolor virginica

The plot pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1]) 10

New dataset - ionosphere require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) b g b 19 8 g

Vary the parameters - ionosphere > (fit.train1 <- train.kknn(class ~., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) Call: train.kknn(formula = class ~., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 > table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) b g b 25 4 g b g b 19 8 g 2 122

Alter distance - ionosphere > (fit.train2 <- train.kknn(class ~., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 > table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) b g b 20 5 g #1 b g b 25 4 g #0 b g b 19 8 g 2 122

(Weighted) kNN Advantages –Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”) –Effective if the training data is large Disadvantages –Need to determine value of parameter K (number of nearest neighbors) –Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only? 14

Additional factors Dimensionality – with too many dimensions the closest neighbors are too far away to be considered close Overfitting – does closeness mean right classification (e.g. noise or incorrect data, like wrong street address -> wrong lat/lon) – beware of k=1! Correlated features – double weighting Relative importance – including/ excluding features 15

More factors Sparseness – the standard distance measure (Jaccard) loses meaning due to no overlap Errors – unintentional and intentional Computational complexity Sensitivity to distance metrics – especially due to different scales (recall ages, versus impressions, versus clicks and especially binary values: gender, logged in/not) Does not account for changes over time Model updating as new data comes in 16

Lots of clustering options _Cluster_analysishttp://wiki.math.yorku.ca/index.php/R: _Cluster_analysis Clustergram - This graph is useful in exploratory analysis for non- hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical. (remember our attempt at a dendogram for mapmeans?) 17

Cluster plotting source(" content/uploads/2012/01/source_https.r.txt") # source code from github require(RCurl) require(colorspace) source_https(" snippets/master/clustergram.r") data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # scaling 18

> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa > head(Data) Sepal.Length Sepal.Width Petal.Length Petal.Width [1,] [2,] [3,] [4,] [5,] [6,]

20 Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re- group together) Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters Run the plot multiple times to observe the stability of the cluster formation (and location)

clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale 21

Any good? set.seed(500) Data2 <- scale(iris[,-5]) par(cex.lab = 1.2, cex.main =.7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data2, k.range = 2:8, line.width =.004, add.center.points = T) # why does this produce different plots? # what defaults are used (kmeans) # PCA?? Remember your linear algebra 22

23

How can you tell it is good? set.seed(250) Data <- rbind(cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))) clustergram(Data, k.range = 2:5, line.width =.004, add.center.points = T) 24

More complex… set.seed(250) Data <- rbind(cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))) clustergram(Data, k.range = 2:8, line.width =.004, add.center.points = T) 25

Exercise - swiss par(mfrow = c(2,3)) swiss.x <- scale(as.matrix(swiss[, -1])) set.seed(1); for(i in 1:6) clustergram(swiss.x, k.range = 2:6, line.width = 0.01) 26

27 clusplot

Hierarchical clustering 28 > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

ctree 29 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

30

pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]) 31

splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6,.8, length.out = 4), y = seq(.9,.6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) }) 32

33 parallelplot(~iris[1:4] | Species, iris)

34 parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

hclust for iris 35

plot(iris_ctree) 36

Ctree > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = )* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = ) Petal.Length <= 4.8; criterion = 0.999, statistic = )* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46 37

> plot(iris_ctree, type="simple”) 38

New dataset to work with trees fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fitK) # display the results plotcp(fitK) # visualize cross-validation results summary(fitK) # detailed summary of splits # plot tree plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis") text(fitK, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill) 39

40

41 > pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"]) > plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") > text(pfitK, use.n=TRUE, all=TRUE, cex=.8) > post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

42 > fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) > plot(fitK, main="Conditional Inference Tree for Kyphosis”)

43 > plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

randomForest > require(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results Call: randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 20.99% Confusion matrix: absent present class.error absent present > importance(fitKF) # importance of each predictor MeanDecreaseGini Age Number Start Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

More on another dataset. # Regression Tree Example library(rpart) # build the tree fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fitM) # display the results …. Root node error: /60 = n=60 (57 observations deleted due to missingness) CP nsplit rel error xerror xstd

Mileage… plotcp(fitM) # visualize cross-validation results summary(fitM) # detailed summary of splits 46

47 par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

# plot tree plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ") text(fitM, use.n=TRUE, all=TRUE, cex=.8) # prune the tree pfitM<- prune(fitM, cp= ) # from cptable # plot the pruned tree plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfitM, use.n=TRUE, all=TRUE, cex=.8) post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”) 48

49

# Conditional Inference Tree for Mileage fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary)) 50

Enough of trees! 51

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa versicolor virginica setosa versicolor virginica pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[u nclass(iris$Species)])

Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) actual predicted setosa versicolor virginica setosa versicolor virginica

Digging into iris > classifier$apriori iris[, 5] setosa versicolor virginica > classifier$tables$Petal.Length Petal.Length iris[, 5] [,1] [,2] setosa versicolor virginica

Digging into iris plot(function(x) dnorm(x, 1.462, ), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, ), add=TRUE, col="blue") curve(dnorm(x, 5.552, ), add=TRUE, col = "green") 55

ench/html/HouseVotes84.html > require(mlbench) > data(HouseVotes84) > model <- naiveBayes(Class ~., data = HouseVotes84) > predict(model, HouseVotes84[1:10,-1]) [1] republican republican republican democrat democrat democrat republican republican republican [10] democrat Levels: democrat republican 56

House Votes 1984 > predict(model, HouseVotes84[1:10,-1], type = "raw") democrat republican [1,] e e-01 [2,] e e-01 [3,] e e-01 [4,] e e-03 [5,] e e-02 [6,] e e-01 [7,] e e-01 [8,] e e-01 [9,] e e-01 [10,] e e-11 57

House Votes 1984 > pred <- predict(model, HouseVotes84[,-1]) > table(pred, HouseVotes84$Class) pred democrat republican democrat republican

So now you could complete this: > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female > margin.table(HairEyeColor,c(1,3)) Sex Hair Male Female Black Brown Red Blond Construct a naïve Bayes classifier and test. 59

Assignments to come… Term project (A6). Due ~ week % (25% written, 5% oral; individual). Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual); 60

Coming weeks I will be out of town Friday March 21 and 28 On March 21 you will have a lab – attendance will be taken – to work on assignments (term (6) and assignment 7). Your project proposals (Assignment 5) are on March 18. On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab. Back to regular schedule in April (except 18 th ) 61

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 62