Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600

Slides:



Advertisements
Similar presentations
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.
Advertisements

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.
ARE OBSERVATIONS OBTAINED DIFFERENT?. ARE OBSERVATIONS OBTAINED DIFFERENT? You use different statistical tests for different problems. We will examine.
A quick introduction to R prog. 淡江統計 陳景祥 (Steve Chen)
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Linear Models Alan Lee Sample presentation for STATS 760.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 6b, March 4, 2016 Interpretation: Regression, Clustering (plotting), Clustergrams, Trees and Hierarchies…
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Stats Methods at IC Lecture 3: Regression.
Applied statistics Usman Roshan.
I. ANOVA revisited & reviewed
GS/PPAL Section N Research Methods and Information Systems
Using the models, prediction, deciding
Step 1: Specify a null hypothesis
Anticipating Patterns Statistical Inference
Chapter 14 Introduction to Multiple Regression
Regression and Correlation
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
More Bayes, Decision trees, and cross-validation
Data Analytics – ITWS-4600/ITWS-6600
Chapter 12 Simple Linear Regression and Correlation
Statistical Data Analysis - Lecture /04/03
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
CHAPTER 7 Linear Correlation & Regression Methods
Group 1 Lab 2 exercises /assignment 2
business analytics II ▌assignment one - solutions autoparts 
Data Analytics – ITWS-4963/ITWS-6965
Correlation – Regression
The Elements of Statistical Learning
Chapter 25 Comparing Counts.
Checking Regression Model Assumptions
Overview of Supervised Learning
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Group 1 Lab 2 exercises and Assignment 2
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Weighted kNN, clustering, “early” trees and Bayesian
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
Console Editeur : myProg.R 1
Prepared by Lee Revere and John Large
Chapter 12 Simple Linear Regression and Correlation
What is Regression Analysis?
Assignment 2 (in lab) Peter Fox and Greg Hughes
Obtaining the Regression Line in R
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
CHAPTER 12 More About Regression
Chapter 26 Comparing Counts.
Product moment correlation
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Chapter 26 Comparing Counts.
Group 1 Lab 2 exercises and Assignment 2
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Presentation transcript:

Interpreting: regression, weighted kNN, clustering, trees and Bayesian methods Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600 Group 2 Module 6, February 13, 2017

Contents

K Nearest Neighbors (classification) Script – group2/lab1_nyt.R > nyt1<-read.csv(“nyt1.csv") … from week 3b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

Bronx 1 = Regression You were reminded that log(0) is … not fun > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] Lab5b_bronx1_2016.R

Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16

Plots – tell me what they tell you!

Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))

How do you interpret this residual plot?

Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))

And this one?

Bronx 2 = complex example See lab1_bronx2.R Manipulation Mapping knn kmeans

Did you get to create the neighborhood map? table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) The MAP!!

mapmeans<-cbind(adduse,as mapmeans<-cbind(adduse,as.numeric(mapcoord$NEIGHBORHOOD)) colnames(mapmeans)[26] <- "NEIGHBORHOOD" #This is the right way of renaming. keeps <- c("ZIP.CODE","NEIGHBORHOOD","TOTAL.UNITS","LAND.SQUARE.FEET","GROSS.SQUARE.FEET","SALE.PRICE","Latitude","Longitude") mapmeans<-mapmeans[keeps]#Dropping others mapmeans$NEIGHBORHOOD<-as.numeric(mapcoord$NEIGHBORHOOD) for(i in 1:8){ mapmeans[,i]=as.numeric(mapmeans[,i]) }#Now done for conversion to numeric

#Classification mapcoord$class<as #Classification mapcoord$class<as.numeric(mapcoord$NEIGHBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid]

KNN! Did you loop over k? { knnpred<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=5) knntesterr<-sum(knnpred!=mappred$class)/length(testid) } knntesterr [1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159 What do you think?

Try these on mapmeans, etc.

K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes"))

> mapobj$centers adduse.ZIP.CODE as.numeric.mapcoord.NEIGHBORHOOD. adduse.TOTAL.UNITS adduse.LAND.SQUARE.FEET 1 10464.09 19.47454 1.550926 2028.285 2 10460.65 16.38710 25.419355 11077.419 3 10454.00 20.00000 1.000000 29000.000 4 10463.45 10.90909 42.181818 10462.273 5 10464.00 17.42857 4.714286 14042.214 adduse.GROSS.SQUARE.FEET adduse.SALE.PRICE adduse..querylist.latitude. adduse..querylist.longitude. 1 1712.887 279950.4 40.85280 -73.87357 2 26793.516 2944099.9 40.85597 -73.89139 3 87000.000 24120881.0 40.80441 -73.92290 4 40476.636 6953345.4 40.86009 -73.88632 5 9757.679 885950.9 40.85300 -73.87781

> plot(mapmeans,mapobj$cluster) > mapobj$size [1] 432 31 1 11 56 ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SF, GROSS.SF, SALE.PRICE, lat, long ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SQUARE.FEET, GROSS.SQUARE.FEET, SALE.PRICE, latitude, longitude'

Return object cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e., sum(withinss). betweenss The between-cluster sum of squares, i.e. totss-tot.withinss. size The number of points in each cluster.

Plotting clusters library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mapmeans, mapobj$cluster)

Plotting clusters require(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

Plot

Clusplot (k=17)

Dendogram for this = tree of the clusters: Highly supported by data? Okay, this is a little complex – perhaps something simpler?

What else could you cluster/classify? SALE.PRICE? If so, how would you measure error? # I added SALE.PRICE as 5th column in adduse… > pcolor<- color.scale(log(mapcoord[,5]),c(0,1,1),c(1,1,0),0) > geoPlot(mapcoord,zoom=12,color=pcolor) TAX.CLASS.AT.PRESENT? TAX.CLASS.AT.TIME.OF.SALE? measure error?

Regression Exercises Using the EPI dataset find the single most important factor in increasing the EPI in a given region Examine distributions down to the leaf nodes and build up an EPI “model”

Linear and least-squares > EPI_data<- read.csv(”EPI_data.csv") > attach(EPI_data) > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)

Linear and least-squares > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Coefficients: (Intercept) DALY AIR_H WATER_H -2.673e-05 5.000e-01 2.500e-01 2.500e-01 > summary(lmENVH) … > cENVH<-coef(lmENVH)

Read the documentation!

Linear and least-squares > summary(lmENVH) Call: lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H) Residuals: Min 1Q Median 3Q Max -0.0072734 -0.0027299 0.0001145 0.0021423 0.0055205 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.673e-05 6.377e-04 -0.042 0.967 DALY 5.000e-01 1.922e-05 26020.669 <2e-16 *** AIR_H 2.500e-01 1.273e-05 19645.297 <2e-16 *** WATER_H 2.500e-01 1.751e-05 14279.903 <2e-16 *** --- p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis

Linear and least-squares Continued: --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.003097 on 178 degrees of freedom (49 observations deleted due to missingness) Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16 > names(lmENVH) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" [7] "qr" "df.residual" "na.action" "xlevels" "call" "terms" [13] "model"

Plot original versus fitted > plot(ENVHEALTH,col="red") > points(lmENVH$fitted.values,col="blue") > Huh?

Try again! > plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red") > points(lmENVH$fitted.values,col="blue")

Predict > cENVH<-coef(lmENVH) > DALYNEW<-c(seq(5,95,5)) #2 > AIR_HNEW<-c(seq(5,95,5)) #3 > WATER_HNEW<-c(seq(5,95,5)) #4

Predict > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENVH,NEW,interval=“prediction”) > cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does

Predict object returns predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc. If se.fit is TRUE, a list with the following components is returned: fit vector or matrix as above se.fit standard error of predicted means residual.scale residual standard deviations df degrees of freedom for residual

Output from predict > head(pENV) fit lwr upr 1 NA NA NA 2 11.55213 11.54591 11.55834 3 18.29168 18.28546 18.29791 4 NA NA NA 5 69.92533 69.91915 69.93151 6 90.20589 90.19974 90.21204 …

> tail(pENV) fit lwr upr 226 NA NA NA 227 NA NA NA 228 34. 95256 34

Read the documentation!

Ionosphere: group2/lab1_kknn2.R require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) # vary kernel (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) #alter distance (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

Results ionosphere.learn <- ionosphere[1:200,] # convenience samping!!!! ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) b g b 19 8 g 2 122

(fit. train1 <- train. kknn(class ~. , ionosphere (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) Call: train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) b g b 25 4 g 2 120

(fit. train2 <- train. kknn(class ~. , ionosphere (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) Call: train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 2, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) b g b 20 5 g 7 119

However… there is more

Naïve Bayes – what is it? Example: testing for a specific item of knowledge that 1% of the population has been informed of (don’t ask how). An imperfect test: 99% of knowledgeable people test positive 99% of ignorant people test negative If a person tests positive – what is the probability that they know the fact?

Naïve approach… We have 10,000 representative people 100 know the fact/item, 9,900 do not We test them all: Get 99 knowing people testing knowing Get 99 not knowing people testing not knowing But 99 not knowing people testing as knowing Testing positive (knowing) – equally likely to know or not = 50%

Tree diagram 10000 ppl 1% know (100ppl) 99% test to know (99ppl) 1% test not to know (1per) 99% do not know (9900ppl) 1% test to know (99ppl) 99% test not to know (9801ppl)

Relation between probabilities For outcomes x and y there are probabilities of p(x) and p (y) that either happened If there’s a connection, then the joint probability = that both happen = p(x,y) Or x happens given y happens = p(x|y) or vice versa then: p(x|y)*p(y)=p(x,y)=p(y|x)*p(x) So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law) E.g. p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5

How do you use it? If the population contains x what is the chance that y is true? p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(word) Base this on data: p(spam) counts proportion of spam versus not p(word|spam) counts prevalence of spam containing the ‘word’ p(word|!spam) counts prevalence of non-spam containing the ‘word’

Or.. What is the probability that you are in one class (i) over another class (j) given another factor (X)? Invoke Bayes: Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known) So: conditional indep -

P(xk | Ci) is estimated from the training samples Categorical: Estimate P(xk | Ci) as percentage of samples of class i with value xk Training involves counting percentage of occurrence of each possible value for each class Numeric: Actual form of density function is generally not known, so “normal” density (i.e. distribution) is often assumed

Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) classifier$apriori classifier$tables$Petal.Length plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue") curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

And use a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~ ., data = Titanic) > mdl Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~ ., data = Titanic) A-priori probabilities: Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Survived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 Try Lab5b_nbayes1_2016.R

Using a contingency table > predict(mdl, as.data.frame(Titanic)[,1:3]) [1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No [26] No No No Yes Yes Yes Yes Levels: No Yes

http://www. ugrad. stat. ubc. ca/R/library/mlbench/html/HouseVotes84 http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class)

Exercise for you > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 How would you construct a naïve Bayes classifier and test it?

And use a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~ ., data = Titanic) > mdl Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~ ., data = Titanic) A-priori probabilities: Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Survived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 Try group2/lab2_nbayes1.R

http://www. ugrad. stat. ubc. ca/R/library/mlbench/html/HouseVotes84 http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class)

nbayes1 > table(pred, HouseVotes84$Class) pred democrat republican democrat 238 13 republican 29 155

> predict(model, HouseVotes84[1:10,-1], type = "raw") democrat republican [1,] 1.029209e-07 9.999999e-01 [2,] 5.820415e-08 9.999999e-01 [3,] 5.684937e-03 9.943151e-01 [4,] 9.985798e-01 1.420152e-03 [5,] 9.666720e-01 3.332802e-02 [6,] 8.121430e-01 1.878570e-01 [7,] 1.751512e-04 9.998248e-01 [8,] 8.300100e-06 9.999917e-01 [9,] 8.277705e-08 9.999999e-01 [10,] 1.000000e+00 5.029425e-11

Ex: Classification Bayes Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. Perform naivebayes classification to get predictors for Age (Rings). Interpret. Discuss in next lab.

Exercise > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 How would you construct a naïve Bayes classifier and test it?

At this point… You may realize the inter-relation among classifications and clustering methods, at an absolute and relative level (i.e. hierarchical -> trees…) is COMPLEX… Trees are interesting from a decision perspective: if this or that, then this…. More in the next module Beyond just distance measures: clustering (kmeans) to probabilities (Bayesian) And, so many ways to visualize them…