1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian

Plot tools/ tips http://statmethods.net/advgraphs/layout.html http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in- r/ pairs, gpairs, scatterplot.matrix, clustergram, etc. data() # precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere More script fragments in R are available on the web site (http://aquarius.tw.rpi.edu/html/DA )http://aquarius.tw.rpi.edu/html/DA 2

Weighted KNN… require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] iris.valid <- iris[val,] iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") summary(iris.kknn) fit <- fitted(iris.kknn) table(iris.valid$Species, fit) pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1]) 3

summary Call: kknn(formula = Species ~., train = iris.learn, test = iris.valid, distance = 1, kernel = "triangular") Response: "nominal" fit prob.setosa prob.versicolor prob.virginica 1 versicolor 0 1.00000000 0.00000000 2 versicolor 0 1.00000000 0.00000000 3 versicolor 0 0.91553003 0.08446997 4 setosa 1 0.00000000 0.00000000 5 virginica 0 0.00000000 1.00000000 6 virginica 0 0.00000000 1.00000000 7 setosa 1 0.00000000 0.00000000 8 versicolor 0 0.66860033 0.33139967 9 virginica 0 0.22534461 0.77465539 10 versicolor 0 0.79921042 0.20078958 11virginica 0 0.00000000 1.00000000 12...... 4

table fit setosa versicolor virginica setosa 15 0 0 versicolor 0 19 1 virginica 0 2 13 5

6 Look at Lab5b_kknn1_2016.R pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

Ctrees? We want a means to make decisions – so how about a “if this then this otherwise that” approach == tree methods, or branching. Conditional Inference – what is that? Instead of: if (This1.and. This2.and. This3.and. …) 7

Conditional Inference Tree > require(party) # don’t get me started! > str(iris) 'data.frame':150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1... > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) 8

Ctree > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46 9

plot(iris_ctree) 10 Lab5b_ctree2_2016.R > plot(iris_ctree, type="simple”) # try this

Beyond plot: pairs pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]) 11 Try Lab5b_pairs1_2016.R - USJudgeRatings

But the means for branching.. Do not have to be threshold based ( ~ distance) Can be cluster based = I am more similar to you if I possess these attributes (in this range) Thus: trees + cluster = hierarchical clustering In R: hclust (and others) in stats package 12

Try hclust for iris 13

gpairs(iris) 14 Try Lab5b_gpairs1_2016.R

Better scatterplots 15 install.packages("car") require(car) scatterplotMatrix(iris) Try Lab5b_spm_2016.R

splom(iris) # default 16 Try Lab5b_splom_2016.R

splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6,.8, length.out = 4), y = seq(.9,.6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) }) parallelplot(~iris[1:4] | Species, iris) parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90))) 17

Shift the dataset… 20

Hierarchical clustering > d <- dist(as.matrix(mtcars)) > hc <- hclust(d) > plot(hc) 21

Swiss - pairs 22 pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

ctree 23 require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

Hierarchical clustering 24 > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

scatterplotMatrix 25

require(lattice); splom(swiss) 26

And use a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~., data = Titanic) > mdl 29 Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~., data = Titanic) A-priori probabilities: Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Survived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 Try Lab5b_nbayes1_2016.R

http://www.ugrad.stat.ubc.ca/R/library/mlb ench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class) 30

Exercise for you > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) Sex Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 How would you construct a naïve Bayes classifier and test it? 31

Cars? 32

Linear regression? Or? 33

Ionosphere: Lab5b_kknn2_2016.R require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) # vary kernel (fit.train1 <- train.kknn(class ~., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) #alter distance (fit.train2 <- train.kknn(class ~., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) 34

Results ionosphere.learn <- ionosphere[1:200,] # convenience samping!!!! ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) b g b 19 8 g 2 122 35

(fit.train1 <- train.kknn(class ~., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) Call: train.kknn(formula = class ~., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) b g b 25 4 g 2 120 36

(fit.train2 <- train.kknn(class ~., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) Call: train.kknn(formula = class ~., data = ionosphere.learn, kmax = 15, distance = 2, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) b g b 20 5 g 7 119 37

However… there is more 38

Naïve Bayes – what is it? Example: testing for a specific item of knowledge that 1% of the population has been informed of (don’t ask how). An imperfect test: –99% of knowledgeable people test positive –99% of ignorant people test negative If a person tests positive – what is the probability that they know the fact? 39

Naïve approach… We have 10,000 representative people 100 know the fact/item, 9,900 do not We test them all: –Get 99 knowing people testing knowing –Get 99 not knowing people testing not knowing –But 99 not knowing people testing as knowing Testing positive (knowing) – equally likely to know or not = 50% 40

Tree diagram 10000 ppl 1% know (100ppl) 99% test to know (99ppl) 1% test not to know (1per) 99% do not know (9900ppl) 1% test to know (99ppl) 99% test not to know (9801ppl) 41

Relation between probabilities For outcomes x and y there are probabilities of p(x) and p (y) that either happened If there’s a connection, then the joint probability = that both happen = p(x,y) Or x happens given y happens = p(x|y) or vice versa then: –p(x|y)*p(y)=p(x,y)=p(y|x)*p(x) So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law) E.g. p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5 42

How do you use it? If the population contains x what is the chance that y is true? p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(w ord) Base this on data: –p(spam) counts proportion of spam versus not –p(word|spam) counts prevalence of spam containing the ‘word’ –p(word|!spam) counts prevalence of non-spam containing the ‘word’ 43

Or.. What is the probability that you are in one class (i) over another class (j) given another factor (X)? Invoke Bayes: Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known) So: conditional indep - 44

P(x k | C i ) is estimated from the training samples – Categorical: Estimate P(x k | C i ) as percentage of samples of class i with value x k Training involves counting percentage of occurrence of each possible value for each class –Numeric: Actual form of density function is generally not known, so “normal” density (i.e. distribution) is often assumed 45

Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) classifier$apriori classifier$tables$Petal.Length plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue") curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green") 46

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 48 pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[u nclass(iris$Species)])

Ex: Classification Bayes Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. Perform naivebayes classification to get predictors for Age (Rings). Interpret. Discuss on Friday. 49

Using a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~., data = Titanic) > mdl 50 Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~., data = Titanic) A-priori probabilities: Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Survived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122

Using a contingency table > predict(mdl, as.data.frame(Titanic)[,1:3]) [1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No [26] No No No Yes Yes Yes Yes Levels: No Yes 51

At this point… You may realize the inter-relation among classifications and clustering methods, at an absolute and relative level (i.e. hierarchical -> trees…) is COMPLEX… –Trees are interesting from a decision perspective: if this or that, then this…. Beyond just distance measures: clustering (kmeans) to probabilities (Bayesian) And, so many ways to visualize them… 52

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.

Similar presentations

Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.

Similar presentations

Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian."— Presentation transcript:

Similar presentations

About project

Feedback