Using the models, prediction, deciding

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.

Machine Learning in R and its use in the statistical offices

Decision Tree Dr. Jieh-Shan George YEH

Ensemble Learning (2), Tree and Forest

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.

Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7b, March 13, 2015 Interpreting weighted kNN, decision trees, cross-validation, dimension reduction.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.

A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.

IE 585 Competitive Network – Learning Vector Quantization & Counterpropagation.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.

Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.

JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004.

Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.

Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 

Decision Tree Lab. Load in iris data: Display iris data as a sanity.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 7a, March 8, 2016 Decision trees, cross-validation.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…

Решение задач Data Mining. R и Hadoop. Классификация Decision Tree  Исходные данные >names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length“ [4]

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Data Analytics – ITWS-4963/ITWS-6965

Statistical Data Analysis STAT221A

More Bayes, Decision trees, and cross-validation

Data Analytics – ITWS-4600/ITWS-6600

Clustering CSC 600: Data Mining Class 21.

Bagging and Random Forests

Erich Smith Coleman Platt

Classification, Clustering and Bayes…

CS 235 Decision Tree Classification

Discriminant Analysis

Principal Component Analysis

Categorizing networks using Machine Learning

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Weighted kNN, clustering, “early” trees and Bayesian

Weka Free and Open Source ML Suite Ian Witten & Eibe Frank

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

Classification and clustering - interpreting and exploring data

Classification, Clustering and Bayes…

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.

Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 10b, April 8, 2016

Model generalization Brief summary of methods

Classification with CART

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Classification, Clustering and Bayes…

Predicting Loan Defaults

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Arko Barman COSC 6335 Data Mining

Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Data Mining CSCI 307, Spring 2019 Lecture 6

Presentation transcript:

Using the models, prediction, deciding Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7b, March 7, 2014

scatterplotMatrix

Hierarchical clustering > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

ctree require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6, .8, length.out = 4), y = seq(.9, .6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) })

parallelplot(~iris[1:4] | Species, iris)

parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

hclust for iris

plot(iris_ctree)

Ctree > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46

> plot(iris_ctree, type="simple”)

New dataset to work with trees fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fitK) # display the results plotcp(fitK) # visualize cross-validation results summary(fitK) # detailed summary of splits # plot tree plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis") text(fitK, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)

> pfitK<- prune(fitK, cp= fitK$cptable[which > pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"]) > plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") > text(pfitK, use.n=TRUE, all=TRUE, cex=.8) > post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

> fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) > plot(fitK, main="Conditional Inference Tree for Kyphosis”)

> plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

randomForest > require(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results Call: randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 20.99% Confusion matrix: absent present class.error absent 59 5 0.0781250 present 12 5 0.7058824 > importance(fitKF) # importance of each predictor MeanDecreaseGini Age 8.654112 Number 5.584019 Start 10.168591 Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

More on another dataset. # Regression Tree Example library(rpart) # build the tree fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fitM) # display the results …. Root node error: 1354.6/60 = 22.576 n=60 (57 observations deleted due to missingness) CP nsplit rel error xerror xstd 1 0.622885 0 1.00000 1.03165 0.176920 2 0.132061 1 0.37711 0.51693 0.102454 3 0.025441 2 0.24505 0.36063 0.079819 4 0.011604 3 0.21961 0.34878 0.080273 5 0.010000 4 0.20801 0.36392 0.075650

Mileage… plotcp(fitM) # visualize cross-validation results summary(fitM) # detailed summary of splits <we will leave this for Friday to look at>

par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

# plot tree plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ") text(fitM, use.n=TRUE, all=TRUE, cex=.8) # prune the tree pfitM<- prune(fitM, cp=0.01160389) # from cptable # plot the pruned tree plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfitM, use.n=TRUE, all=TRUE, cex=.8) post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”)

# Conditional Inference Tree for Mileage fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

Enough of trees!

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) actual predicted setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47

Digging into iris > classifier$apriori iris[, 5] setosa versicolor virginica 50 50 50 > classifier$tables$Petal.Length Petal.Length iris[, 5] [,1] [,2] setosa 1.462 0.1736640 versicolor 4.260 0.4699110 virginica 5.552 0.5518947

Digging into iris plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue") curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")

http://www. ugrad. stat. ubc. ca/R/library/mlbench/html/HouseVotes84 http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html > require(mlbench) > data(HouseVotes84) > model <- naiveBayes(Class ~ ., data = HouseVotes84) > predict(model, HouseVotes84[1:10,-1]) [1] republican republican republican democrat democrat democrat republican republican republican [10] democrat Levels: democrat republican

House Votes 1984 > predict(model, HouseVotes84[1:10,-1], type = "raw") democrat republican [1,] 1.029209e-07 9.999999e-01 [2,] 5.820415e-08 9.999999e-01 [3,] 5.684937e-03 9.943151e-01 [4,] 9.985798e-01 1.420152e-03 [5,] 9.666720e-01 3.332802e-02 [6,] 8.121430e-01 1.878570e-01 [7,] 1.751512e-04 9.998248e-01 [8,] 8.300100e-06 9.999917e-01 [9,] 8.277705e-08 9.999999e-01 [10,] 1.000000e+00 5.029425e-11

House Votes 1984 > pred <- predict(model, HouseVotes84[,-1]) > table(pred, HouseVotes84$Class) pred democrat republican democrat 238 13 republican 29 155

So now you could complete this: > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 Construct a naïve Bayes classifier and test.

Assignments to come… Term project (A6). Due ~ week 13. 30% (25% written, 5% oral; individual). Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual);

Coming weeks I will be out of town Friday March 21 and 28 On March 21 you will have a lab – attendance will be taken – to work on assignments (term (6) and assignment 7). Your project proposals (Assignment 5) are on March 18. On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab. Back to regular schedule in April (except 18th)

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg) Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri chenil@rpi.edu Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 Schedule, lectures, syllabus, reading, assignments, etc.