Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 10, 2015 Labs: more data, models, prediction, deciding with trees.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
What is Statistical Modeling
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Advanced Multimedia Text Classification Tamara Berg.
A quick introduction to R prog. 淡江統計 陳景祥 (Steve Chen)
1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
Chap 4-1 Chapter 4 Using Probability and Probability Distributions.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Stats Methods at IC Lecture 3: Regression.
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Mining Introduction to Classification using Linear Classifiers
Chapter 4: Basic Estimation Techniques
Using the models, prediction, deciding
Data Mining ICCM
INTRODUCTION AND DEFINITIONS
BINARY LOGISTIC REGRESSION
Chapter 4 Basic Estimation Techniques
More Bayes, Decision trees, and cross-validation
Matt Gormley Lecture 3 September 7, 2016
Data Transformation: Normalization
Data Analytics – ITWS-4600/ITWS-6600
PDF, Normal Distribution and Linear Regression
Logistic Regression APKC – STATS AFAC (2016).
Chapter 4 Using Probability and Probability Distributions
Group 1 Lab 2 exercises /assignment 2
Data Analytics – ITWS-4963/ITWS-6965
CH 5: Multivariate Methods
Discriminant Analysis
The Elements of Statistical Learning
Chapter 25 Comparing Counts.
Advanced Analytics Using Enterprise Miner
Overview of Supervised Learning
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Weighted kNN, clustering, “early” trees and Bayesian
Classification and clustering - interpreting and exploring data
Association, correlation and regression in biomedical research
Assignment 2 (in lab) Peter Fox and Greg Hughes
STAT 312 Introduction Z-Tests and Confidence Intervals for a
MIS2502: Data Analytics Clustering and Segmentation
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
MIS2502: Data Analytics Clustering and Segmentation
Chapter 26 Comparing Counts.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Chapter 26 Comparing Counts.
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Group 1 Lab 2 exercises and Assignment 2
STT : Intro. to Statistical Learning
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Interpretation of lab assignment 2, Weighted kNN, and introduction to Bayesian methods Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960 Group 2 Module 6, September 25, 2018

Regression (1) Retrieve this dataset: dataset_multipleRegression.csv Using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD), predict the fall enrollment (ROLL) for this year by knowing that UNEM=7% and HGRAD=90,000. Repeat and add per capita income (INC) to the model. Predict ROLL if INC=$25,000 Summarize and compare the two models. Comment on significance, anything else?

Object of class lm: An object of class "lm" is a list containing at least the following components: coefficients a named vector of coefficients residuals the residuals, that is response minus fitted values. fitted.values the fitted mean values. rank the numeric rank of the fitted linear model. weights (only for weighted fits) the specified weights. df.residual the residual degrees of freedom. call the matched call. terms the terms object used. contrasts (only where relevant) the contrasts used. xlevels (only where relevant) a record of the levels of the factors used in fitting. offset the offset used (missing if none were used). y if requested, the response used. x if requested, the model matrix used. model if requested (the default), the model frame used.

Classification (2) Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope: a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Perform knn classification to get predictors for Age (Rings). Interpretation not required.

Classification Exercises (group1/lab2_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value)

K Nearest Neighbors (classification) > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

For the abalone assignment How many levels (of Rings)? What may make more sense for classification? How many variables to include? Value of k? Categorical variable?

Clustering (3) The Iris dataset (in R use data(“iris”) to load it) The 5th column is the species and you want to find how many clusters without using that information Create a new data frame and remove the fifth column Apply kmeans (you choose k) with 1000 iterations Use table(iris[,5],<your clustering>) to assess your results

Bayes > cl <- kmeans(iris[,1:4], 3) # > cl <- kmeans(iris[,1:4], > cl <- kmeans(iris[,1:4], 3) 4) > table(cl$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Return object (cl in last slide) cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e., sum(withinss). betweenss The between-cluster sum of squares, i.e. totss-tot.withinss. size The number of points in each cluster.

Weighted knn But more importantly learning to vary your input choices, optimize, compare, choose

Ionosphere: group2/lab2_kknn2.R require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) # vary kernel (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) #alter distance (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

Results ionosphere.learn <- ionosphere[1:200,] # convenience samping!!!! ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) b g b 19 8 g 2 122

(fit. train1 <- train. kknn(class ~. , ionosphere (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) Call: train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) b g b 25 4 g 2 120

(fit. train2 <- train. kknn(class ~. , ionosphere (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) Call: train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 2, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) b g b 20 5 g 7 119

However… there is more

Naïve Bayes – what is it? Example: testing for a specific item of knowledge that 1% of the population has been informed of (don’t ask how). An imperfect test: 99% of knowledgeable people test positive 99% of ignorant people test negative If a person tests positive – what is the probability that they know the fact?

Naïve approach… We have 10,000 representative people 100 know the fact/item, 9,900 do not We test them all: Get 99 knowing people testing knowing Get 99 not knowing people testing not knowing But 99 not knowing people testing as knowing Testing positive (knowing) – equally likely to know or not = 50%

Tree diagram (conditional …) 10000 ppl 1% know (100ppl) 99% test to know (99ppl) 1% test not to know (1per) 99% do not know (9900ppl) 1% test to know (99ppl) 99% test not to know (9801ppl)

Relation between probabilities For outcomes x and y there are probabilities of p(x) and p (y) that either happened If there’s a connection, then the joint probability = that both happen = p(x,y) Or x happens given y happens = p(x|y) or vice versa then: p(x|y)*p(y)=p(x,y)=p(y|x)*p(x) So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law) E.g. p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5

How do you use it? If the population contains x what is the chance that y is true? p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(word) Base this on data: p(spam) counts proportion of spam versus not p(word|spam) counts prevalence of spam containing the ‘word’ p(word|!spam) counts prevalence of non-spam containing the ‘word’

Or.. What is the probability that you are in one class (i) over another class (j) given another factor (X)? Invoke Bayes: Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known) So: conditional indep -

P(xk | Ci) is estimated from the training samples Categorical: Estimate P(xk | Ci) as percentage of samples of class i with value xk Training involves counting percentage of occurrence of each possible value for each class Numeric: Actual form of density function is generally not known, so “normal” density (i.e. distribution) is often assumed

Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) classifier$apriori classifier$tables$Petal.Length plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue") curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Weighted KNN… require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] iris.valid <- iris[val,] iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") summary(iris.kknn) fit <- fitted(iris.kknn) table(iris.valid$Species, fit) pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

summary Call: kknn(formula = Species ~ ., train = iris.learn, test = iris.valid, distance = 1, kernel = "triangular") Response: "nominal" fit prob.setosa prob.versicolor prob.virginica 1 versicolor 0 1.00000000 0.00000000 2 versicolor 0 1.00000000 0.00000000 3 versicolor 0 0.91553003 0.08446997 4 setosa 1 0.00000000 0.00000000 5 virginica 0 0.00000000 1.00000000 6 virginica 0 0.00000000 1.00000000 7 setosa 1 0.00000000 0.00000000 8 versicolor 0 0.66860033 0.33139967 9 virginica 0 0.22534461 0.77465539 10 versicolor 0 0.79921042 0.20078958 virginica 0 0.00000000 1.00000000 ......

table fit setosa versicolor virginica setosa 15 0 0 versicolor 0 19 1 virginica 0 2 13

And you get a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~ ., data = Titanic) > mdl Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~ ., data = Titanic) A-priori probabilities: Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Survived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 group2/lab2_nbayes1.R

Predict > predict(mdl, as.data.frame(Titanic)[,1:3]) [1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No [26] No No No Yes Yes Yes Yes Levels: No Yes

http://www. ugrad. stat. ubc. ca/R/library/mlbench/html/HouseVotes84 http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class)

Exercise for you > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 How would you construct a naïve Bayes classifier and test it?

Ex: Classification Bayes Retrieve the abalone.csv dataset Predicting the age of abalone from physical measurements. Perform naivebayes classification to get predictors for Age (Rings). Interpret. Discuss in lab or on LMS.

At this point… You may realize the inter-relation among classifications and clustering methods, at an absolute and relative level (i.e. hierarchical -> trees…) is COMPLEX… Trees are interesting from a decision perspective: if this or that, then this…. Their “conditional” nature, i.e. branching based on threshold criteria, probability, significance…. Beyond just distance measures: weighting, clustering (kmeans) to probabilities (Bayesian) And, so many ways to visualize them…