1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Introduction to Data Mining with XLMiner
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7b, March 13, 2015 Interpreting weighted kNN, decision trees, cross-validation, dimension reduction.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Welcome to MDM4U (Mathematics of Data Management, University Preparation)
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Machine Learning Queens College Lecture 2: Decision Trees.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
An Exercise in Machine Learning
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Decision Tree Lab. Load in iris data: Display iris data as a sanity.
Dr. Thomas Tomasi Associate Dean, Graduate College.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 7a, March 8, 2016 Decision trees, cross-validation.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
A Smart Tool to Predict Salary Trends of H1-B Holders
Using the models, prediction, deciding
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
More Bayes, Decision trees, and cross-validation
Data Analytics – ITWS-4600/ITWS-6600
k-Nearest neighbors and decision tree
Text Mining CSC 600: Data Mining Class 20.
Group 1 Lab 2 exercises /assignment 2
Classification, Clustering and Bayes…
Overview of Supervised Learning
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
Classification and clustering - interpreting and exploring data
Classification, Clustering and Bayes…
Assignment 2 (in lab) Peter Fox and Greg Hughes
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Labs: Trees, Dimension Reduction, Multi-dimensional Scaling, SVM
Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Text Mining CSC 576: Data Mining.
Classification, Clustering and Bayes…
Predicting Loan Defaults
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees

Assignment 6 on Website Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). Grad students must develop two types of models. –Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Introduction (2%) Data Description (3%) Analysis (5%) Model Development (12%) Conclusions and Discussion (3%) Oral presentation (5%) (~5 mins) 2

Titanic – Bayes (from last week) > data(Titanic) > mdl <- naiveBayes(Survived ~., data = Titanic) > mdl 3 Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~., data = Titanic) A-priori probabilities: Survived No Yes Conditional probabilities: Class Survived 1st 2nd 3rd Crew No Yes Sex Survived Male Female No Yes Age Survived Child Adult No Yes Try Lab6b_9_2014.R

Classification Bayes (last week) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. Perform naivebayes classification to get predictors for Age (Rings). Interpret. Compare to what you got from kknn (weighted nearest neighbors) in class 4b 4

ench/html/HouseVotes84.html > require(mlbench) > data(HouseVotes84) > model <- naiveBayes(Class ~., data = HouseVotes84) > predict(model, HouseVotes84[1:10,-1]) [1] republican republican republican democrat democrat democrat republican republican republican [10] democrat Levels: democrat republican 5

House Votes 1984 > predict(model, HouseVotes84[1:10,-1], type = "raw") democrat republican [1,] e e-01 [2,] e e-01 [3,] e e-01 [4,] e e-03 [5,] e e-02 [6,] e e-01 [7,] e e-01 [8,] e e-01 [9,] e e-01 [10,] e e-11 6

House Votes 1984 > pred <- predict(model, HouseVotes84[,-1]) > table(pred, HouseVotes84$Class) pred democrat republican democrat republican

Hair, eye color > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female > margin.table(HairEyeColor,c(1,3)) Sex Hair Male Female Black Brown Red Blond Construct a naïve Bayes classifier and test it! 8

Another example > A = c(1, 2.5); B = c(5, 10); C = c(23, 34) > D = c(45, 47); E = c(4, 17); F = c(18, 4) > df <- data.frame(rbind(A,B,C,D,E,F)) > colnames(df) <- c("x","y") > hc <- hclust(dist(df)) > plot(hc) > df$cluster <- cutree(hc,k=2) # 2 clusters > plot(y~x,df,col=cluster) 9

See also Lab5a_ctree_1_2015.R –Try clustergram instead –Try hclust Lab3b_kmeans1_2015.R –Try clustergram instead –Try hclust 10

New dataset to work with trees fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fitK) # display the results plotcp(fitK) # visualize cross-validation results summary(fitK) # detailed summary of splits # plot tree plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis") text(fitK, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill) 11

12

13 > pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"]) > plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") > text(pfitK, use.n=TRUE, all=TRUE, cex=.8) > post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

14 > fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) > plot(fitK, main="Conditional Inference Tree for Kyphosis”)

15 > plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

Swiss - scatterplotMatrix 16

Hierarchical clustering 17 > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

ctree 18 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

19 How could you get this?

Rpart – recursive partitioning 20 require(rpart) Swiss_rpart <- rpart(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_rpart) # try some different plot options text(swiss_rpart) # try some different text options # try other data

Rpart – recursive partitioning 21 Try this for “Rings” on the Abalone dataset Try ctree – compare – we’ll discuss these Friday But if you do the ctree you may want to “try pruning”

Mileage dataset. # Regression Tree Example require(rpart) # build the tree fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fitM) # display the results …. Root node error: /60 = n=60 (57 observations deleted due to missingness) CP nsplit rel error xerror xstd

Mileage… plotcp(fitM) # visualize cross-validation results summary(fitM) # detailed summary of splits 23

24 par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

# plot tree plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ") text(fitM, use.n=TRUE, all=TRUE, cex=.8) # prune the tree pfitM<- prune(fitM, cp= ) # from cptable # plot the pruned tree plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfitM, use.n=TRUE, all=TRUE, cex=.8) post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”) 25

26

# Conditional Inference Tree for Mileage fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary)) 27

There are many other datasets Try as many as you can Titanic? 28

Enough of trees! 29

Coming weeks Your project proposals (Assignment 5) are on March 17/20. Come prepared. On March 20 you will likely also have a lab – attendance will be taken. Spring break - March 23 – 27 On March 31/April 3 you will have lectures on support vector machines = SVM Back to ~ regular schedule in April 30