Classification and clustering - interpreting and exploring data

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Clustering.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.
Introduction to Bioinformatics
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Evaluating Performance for Data Mining Techniques
Module 04: Algorithms Topic 07: Instance-Based Learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Gilad Lerman Math Department, UMN
Unsupervised Learning
Using the models, prediction, deciding
MATH-138 Elementary Statistics
Data Analytics – ITWS-4600/ITWS-6600
Clustering CSC 600: Data Mining Class 21.
Data Science Algorithms: The Basic Methods
Group 1 Lab 2 exercises /assignment 2
Instance Based Learning
Principal Component Analysis
Unsupervised Learning - Clustering 04/03/17
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Even More on Dimensions
Unsupervised Learning - Clustering
Instance Based Learning (Adapted from various sources)
K-means and Hierarchical Clustering
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
AIM: Clustering the Data together
Group 1 Lab 2 exercises and Assignment 2
K Nearest Neighbor Classification
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Clustering.
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Revision (Part II) Ke Chen
Multivariate Statistical Methods
Instance Based Learning
Data Mining – Chapter 4 Cluster Analysis Part 2
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Nearest Neighbors CSC 576: Data Mining.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Hairong Qi, Gonzalez Family Professor
Group 1 Lab 2 exercises and Assignment 2
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Unsupervised Learning
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classification and clustering - interpreting and exploring data Peter Fox Data Analytics ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960 Group 2, Module 5, September 18, 2018

http://www.mathworks.com/help/examples/stats/ClassifyUsingKNearestNeighborsExample_01.png

K-nearest neighbors (knn) Can be used in both regression and classification (“non-parametric”) Is supervised, i.e. training set and test set KNN is a method for classifying objects based on closest training examples in the feature space. An object is classified by a majority vote of its neighbors. K is always a positive integer. The neighbors are taken from a set of objects for which the correct classification is known. It is usual to use the Euclidean distance, though other distance measures such as the Manhattan distance could in principle be used instead. ~ http://www.codeproject.com/Articles/32970/K-Nearest-Neighbor-Algorithm-Implementation-and-Ov

Algorithm The algorithm on how to compute the K-nearest neighbors is as follows: Determine the parameter K = number of nearest neighbors beforehand. This value is all up to you. Calculate the distance between the query-instance and all the training samples. You can use any distance algorithm. Sort the distances for all the training samples and determine the nearest neighbor based on the K-th minimum distance. Since this is supervised learning, get all the categories of your training data for the sorted value which fall under K. Use the majority of nearest neighbors as the prediction value. http://www.codeproject.com/Articles/32970/K-Nearest-Neighbor-Algorithm-Implementation-and-Ov

Distance metrics Euclidean distance is the most common use of distance. When people talk about distance, this is what they are referring to. Euclidean distance, or simply 'distance', examines the root of square differences between the coordinates of a pair of objects. This is most generally known as the Pythagorean theorem. The taxicab metric is also known as rectilinear distance, L1 distance or L1 norm, city block distance, Manhattan distance, or Manhattan length, with the corresponding variations in the name of the geometry. It represents the distance between points in a city road grid. It examines the absolute differences between the coordinates of a pair of objects. http://www.codeproject.com/Articles/32292/Quantitative-Variable-Distances

More generally The general metric for distance is the Minkowski distance. When lambda is equal to 1, it becomes the city block distance, and when lambda is equal to 2, it becomes the Euclidean distance. The special case is when lambda is equal to infinity (taking a limit), where it is considered as the Chebyshev distance. Chebyshev distance is also called the Maximum value distance, defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. In other words, it examines the absolute magnitude of the differences between the coordinates of a pair of objects. http://www.codeproject.com/Articles/32292/Quantitative-Variable-Distances

http://www.mathworks.com/help/examples/stats/ClassifyUsingKNearestNeighborsExample_01.png

Choice of k? Don’t you hate it when the instructions read: the choice of ‘k’ is all up to you ?? Loop over different k, evaluate results… http://cgm.cs.mcgill.ca/~soss/cs644/projects/perrier/Image25.gif

What does “Near” mean… More on this in the next topic but … DISTANCE – and what does that mean RANGE – acceptable, expected? SHAPE – i.e. the form

Training and Testing We are going to do much more on this going forward… Regression (un-supervised) – uses all the data to ‘train’ the model, i.e. calculate coefficients Residuals are differences between actual and model for all data Supervision means not all the data is used to train because you want to test on the untrained set (before you predict for new values) What is the ‘sampling’ strategy for training? (1b)

Summing up ‘knn’ Advantages Disadvantages Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”) Effective if the training data is large Disadvantages Need to determine value of parameter K (number of nearest neighbors) Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only? Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost. http://www.codeproject.com/Articles/32970/K-Nearest-Neighbor-Algorithm-Implementation-and-Ov

K-means Unsupervised classification, i.e. no classes known beforehand Types: Hierarchical: Successively determine new clusters from previously determined clusters (parent/child clusters). Partitional: Establish all clusters at once, at the same level.

http://www.mathworks.com/help/examples/stats/ClassifyUsingKNearestNeighborsExample_01.png

Distance Measure Clustering is about finding “similarity”. To find how similar two objects are, one needs a “distance” measure. Similar objects (same cluster) should be close to one another (short distance).

Distance Measure Many ways to define distance measure. Some elements may be close according to one distance measure and further away according to another. Select a good distance measure is an important step in clustering.

Some Distance Functions (again) Euclidean distance (2-norm): the most commonly used, also called “crow distance”. Manhattan distance (1-norm): also called “taxicab distance”. In general: Minkowski Metric (p-norm):

K-Means Clustering Separate the objects (data points) into K clusters. Cluster center (centroid) = the average of all the data points in the cluster. Assigns each data point to the cluster whose centroid is nearest (using distance function.)

K-Means Algorithm Place K points into the space of the objects being clustered. They represent the initial group centroids. Assign each object to the group that has the closest centroid. Recalculate the positions of the K centroids. Repeat Steps 2 & 3 until the group centroids no longer move.

K-Means Algorithm: Example Output

Describe v. Predict

Predict = Decide

K-means "Age","Gender","Impressions","Clicks","Signed_In" 36,0,3,0,1 73,1,3,0,1 30,0,3,0,1 49,1,3,0,1 47,1,11,0,1 47,0,11,1,1 (nyt datasets) Model e.g.: If Age<45 and Impressions >5 then Gender=female (0) Age ranges? 41-45, 46-50, etc?

Contingency tables > table(nyt1$Impressions,nyt1$Gender) # 0 1 1 69 85 2 389 395 3 975 937 4 1496 1572 5 1897 2012 6 1822 1927 7 1525 1696 8 1142 1203 9 722 711 10 366 400 11 214 200 12 86 101 13 41 43 14 10 9 15 5 7 16 0 4 17 0 1 Contingency table - displays the (multivariate) frequency distribution of the variable. Tests for significance (not now) > table(nyt1$Clicks,nyt1$Gender) 0 1 1 10335 10846 2 415 440 3 9 17

Classification Exercises (group1/lab2_knn1.R) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

K Nearest Neighbors (classification) > nyt1<-read.csv(“nyt1.csv") … from week 3 lab slides or scripts > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

Weighted KNN… require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] iris.valid <- iris[val,] iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") summary(iris.kknn) fit <- fitted(iris.kknn) table(iris.valid$Species, fit) pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

summary Call: kknn(formula = Species ~ ., train = iris.learn, test = iris.valid, distance = 1, kernel = "triangular") Response: "nominal" fit prob.setosa prob.versicolor prob.virginica 1 versicolor 0 1.00000000 0.00000000 2 versicolor 0 1.00000000 0.00000000 3 versicolor 0 0.91553003 0.08446997 4 setosa 1 0.00000000 0.00000000 5 virginica 0 0.00000000 1.00000000 6 virginica 0 0.00000000 1.00000000 7 setosa 1 0.00000000 0.00000000 8 versicolor 0 0.66860033 0.33139967 9 virginica 0 0.22534461 0.77465539 10 versicolor 0 0.79921042 0.20078958 virginica 0 0.00000000 1.00000000 ......

table fit setosa versicolor virginica setosa 15 0 0 versicolor 0 19 1 virginica 0 2 13

pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

Ctrees? We want a means to make decisions – so how about a “if this then this otherwise that” approach == tree methods, or branching. Conditional Inference – what is that? Instead of: if (This1 .and. This2 .and. This3 .and. …)

Decision tree classifier

Conditional Inference Tree > require(party) # don’t get me started! > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

Ctree > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46

plot(iris_ctree) > plot(iris_ctree, type="simple”) # try this

Beyond plot: pairs pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

But the means for branching.. Do not have to be threshold based ( ~ distance) Can be cluster based = I am more similar to you if I possess these attributes (in this range) Thus: trees + cluster = hierarchical clustering In R: hclust (and others) in stats package

Try hclust for iris

gpairs(iris)

Better scatterplots install.packages("car") require(car) scatterplotMatrix(iris)

splom(iris) # default

splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6, .8, length.out = 4), y = seq(.9, .6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) }) parallelplot(~iris[1:4] | Species, iris) parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

Shift the dataset…

Hierarchical clustering > d <- dist(as.matrix(mtcars)) > hc <- hclust(d) > plot(hc)

Data(swiss) - pairs pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

ctree require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

Hierarchical clustering > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs)

scatterplotMatrix

require(lattice); splom(swiss)

Start collecting Your favorite plotting routines Get familiar with annotating plots

Assignment 3 Preliminary and Statistical Analysis. Due October 5. 15% (written) Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt2…31 datasets. See LMS … for Assignment and details. Assignment 4 and 5 and 6 available…