Introduction to Data Science: Lecture 6

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Hierarchical Clustering, DBSCAN The EM Algorithm
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Unsupervised Learning
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Clustering CMPUT 466/551 Nilanjan Ray. What is Clustering? Attach label to each observation or data points in a set You can say this “unsupervised classification”
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering.
Visual Recognition Tutorial
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Gaussian Mixture Models and Expectation Maximization.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Flat clustering approaches
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Big Data Infrastructure
Clustering (1) Clustering Similarity measure Hierarchical clustering
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Chapter 7. Classification and Prediction
Bayesian Rule & Gaussian Mixture Models
Machine Learning Lecture 9: Clustering
Instance Based Learning
Clustering Evaluation The EM Algorithm
Latent Variables, Mixture Models and EM
K-means and Hierarchical Clustering
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Clustering.
KAIST CS LAB Oh Jong-Hoon
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CS 391L: Machine Learning Clustering
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Biointelligence Laboratory, Seoul National University
Clustering Techniques
Junheng, Shengming, Yunsheng 11/09/2018
Radial Basis Functions: Alternative to Back Propagation
EM Algorithm and its Applications
Presentation transcript:

Introduction to Data Science: Lecture 6 March, 2016 Introduction to Data Science: Lecture 6 Dr. Lev Faivishevsky

Agenda Clustering Anomaly Detection Change Detection Hierarchical K-means GMM Anomaly Detection Change Detection

Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided).

Clustering Example . . .

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Aglommerative vs. Divisive Clustering Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters.

Direct Clustering Method Direct clustering methods require a specification of the number of clusters, k, desired. A clustering evaluation function assigns a real-value quality measure to a clustering. The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.

Hierarchical Agglomerative Clustering (HAC) Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj

Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members.

Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. Appropriate in some domains, such as clustering islands.

Single Link Example

Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes more “tight,” spherical clusters that are typically preferable.

Complete Link Example

Computational Complexity In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.

Computing Cluster Similarity After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by: Single Link: Complete Link:

Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters to encourage tight clusters.

Computing Group Average Similarity Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time:

Non-Hierarchical Clustering Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.

Distances: Ordinal and Categorical Variables Ordinal variables can be forced to lie within (0, 1) and then a quantitative metric can be applied: For categorical variables, distances must be specified by user between each pair of categories. Often weighted sum is used:

K-means Overview An unsupervised clustering algorithm “K” stands for number of clusters, it is typically a user input to the algorithm; some criteria can be used to automatically estimate K It is an approximation to an NP-hard combinatorial optimization problem K-means algorithm is iterative in nature It converges, however only a local minimum is obtained Works only for numerical data Easy to implement

K-means: Setup x1,…, xN are data points or vectors of observations Each observation (vector xi) will be assigned to one and only one cluster C(i) denotes cluster number for the ith observation Dissimilarity measure: Euclidean distance metric K-means minimizes within-cluster point scatter: mk is the mean vector of the kth cluster Nk is the number of observations in kth cluster

K-means Algorithm For a given cluster assignment C of the data points, compute the cluster means mk: For a current set of cluster means, assign each observation as: Iterate above two steps until convergence

K-means clustering example

Distance Metric: Euclidean Distance K-means: Example 2, Step 1 Distance Metric: Euclidean Distance k1 k2 k3

Distance Metric: Euclidean Distance K-means: Example 2, Step 2 Distance Metric: Euclidean Distance k1 k2 k3

Distance Metric: Euclidean Distance K-means: Example 2, Step 3 Distance Metric: Euclidean Distance k1 k3 k2

Distance Metric: Euclidean Distance K-means: Example 2, Step 4 Distance Metric: Euclidean Distance k1 k3 k2

Distance Metric: Euclidean Distance K-means: Example 2, Step 5 Distance Metric: Euclidean Distance k1 k2 k3

K-means: summary Algorithmically, very simple to implement K-means converges, but it finds a local minimum of the cost function Works only for numerical observations K is a user input; Outliers can considerable trouble to K-means

The Problem You have data that you believe is drawn from n populations You want to identify parameters for each population You don’t know anything about the populations a priori Except you believe that they’re gaussian…

Gaussian Mixture Models Rather than identifying clusters by “nearest” centroids Fit a Set of k Gaussians to the data Maximum Likelihood over a mixture model p(x) = \pi_0f_0(x) + \pi_1f_1(x) + \pi_2f_2(x) + \ldots + \pi_kf_k(x)

GMM example

Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Gaussian Mixture Models GMM: the weighted sum of a number of Gaussians where the weights are determined by a distribution,

Graphical Models with unobserved variables What if you have variables in a Graphical model that are never observed? Latent Variables Training latent variable models is an unsupervised learning application uncomfortable amused sweating laughing

Latent Variable HMMs We can cluster sequences using an HMM with unobserved state variables We will train latent variable models using Expectation Maximization

Expectation Maximization Both the training of GMMs and Graphical Models with latent variables can be accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the “responsibilities” of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing “responsibilities” Similar to k-means training.

Latent Variable Representation We can represent a GMM involving a latent variable What does this give us?

GMM data and Latent variables

One last bit We have representations of the joint p(x,z) and the marginal, p(x)… The conditional of p(z|x) can be derived using Bayes rule. The responsibility that a mixture component takes for explaining an observation x.

Maximum Likelihood over a GMM As usual: Identify a likelihood function And set partials to zero…

Maximum Likelihood of a GMM Optimization of means.

Maximum Likelihood of a GMM Optimization of covariance

Maximum Likelihood of a GMM Optimization of mixing term \frac{\partial \ln p(x|\pi, \mu,\Sigma) + \lambda\left(\sum_{k=1}^K \pi_k -1\right)}{\partial \pi_k}&=&

MLE of a GMM

EM for GMMs Initialize the parameters Evaluate the log likelihood Expectation-step: Evaluate the responsibilities Maximization-step: Re-estimate Parameters Check for convergence

EM for GMMs E-step: Evaluate the Responsibilities

EM for GMMs M-Step: Re-estimate Parameters

Visual example of EM

Potential Problems Incorrect number of Mixture Components Singularities

Incorrect Number of Gaussians

Incorrect Number of Gaussians

Singularities A minority of the data can have a disproportionate effect on the model likelihood. For example…

GMM example

Relationship to K-means K-means makes hard decisions. Each data point gets assigned to a single cluster. GMM/EM makes soft decisions. Each data point can yield a posterior p(z|x) Soft K-means is a special case of EM.

Soft means as GMM/EM Assume equal covariance matrices for every mixture component: Likelihood: Responsibilities: As epsilon approaches zero, the responsibility approaches unity. p(x|\mu_k,\Sigma_k) = \frac{1}{(2\pi\epsilon)^{M/2}}\exp\left\{-\frac{1}{2\epsilon}\lVert x-\mu_k\rVert^2\right\}

Soft K-Means as GMM/EM Overall Log likelihood as epsilon approaches zero: The expectation of soft k-means is the intercluster variability Note: only the means are reestimated in Soft K-means. The covariance matrices are all tied.

General form of EM Given a joint distribution over observed and latent variables: Want to maximize: Initialize parameters E Step: Evaluate: M-Step: Re-estimate parameters (based on expectation of complete-data log likelihood) Check for convergence of params or likelihood

AnomalyDetection

Explored Methods Change detection Anomaly detection KNN based Kullback Leibler Divergence Compared with Kolmogorov Smirnov (1D) Anomaly detection One Class SVM Compared with Mahalanobis distance

Anomaly detection Single sample detection Outlier wrt baseline behavior Techniques Quantify usual behavior (train) “Multidimensional distribution” Measure probability for current point (test) Declare ‘outlier’, if p < threshold Methods used One class SVM Mahalanobis distance

Mahalanobis distance Data are assumed N(µ,S) Fit (µ,S) from data (train) Fine tune threshold on Use validation set Detect outlier x with distance > threshold

One class SVM Cast of ordinary binary SVM State-of-the-art novelty detection Smallest volume sphere with (1-ν) of data inside Prob(outlier) = ν ν comes explicitly into SVM target function Robustness Control of False Alarm rate Optionally fine tune threshold Use validation set Define precise location of decision surface ρ

Multidim. anomaly detection, 10K runs, N(0,I(D)) Test FA tuned Actual FA Detection Rate SVM, 20D, Σ =Σ +I(20) 0.02 0.015 0.596 Mahalanobis, 20D, Σ =Σ +I(20) 0.016 0.601 SVM, 2D, Σ =Σ +I(2) 0.028 0.144 Mahalanobis, 2D, Σ =Σ +I(2) 0.023 0.150 SVM, 2D, µ = µ +1 0.022 0.112 Mahalanobis, 2D, µ = µ +1 0.131 SVM, 20D, µ = µ +1 0.017 0.613 Mahalanobis, 20D, µ = µ +1 0.622 SVM, 2D, ρ = ρ + 0.9 0.018 0.038 Mahalanobis, 2D, ρ = ρ + 0.9 0.020 0.041

Keystroke – Real world finger typing timings Real world data of finger typing timings Same 10-letter password is repetitively typed 51 human subjects 8 daily sessions per each human 50 repetitions in each daily session Each typing is characterized by 20 timings of key up – key pressed Overall each human is represented by 400 samples * 20 sensors Dataset applicable to Anomaly detection Change detection Knowledge extraction (mutliclass classification) Full description and some R implementations at http://www.cs.cmu.edu/~keystroke/

Performance comparison on real data Method Anomaly detection rate (tuned for FA 0.05) Actual False Alarm rate (tuned for FA 0.05) Anomaly detection rate (tuned for FA 0.02) (tuned for FA 0.02) SVM One Class, 20D 0.59 ± 0.281 0.050 ± 0.068 0.448 ± 0.304 0.024 ± 0.035 Mahalanobis, 20D 0.55 ± 0.295 0.062 ± 0.074 0.464 ± 0.307 0.045 ± 0.065 SVM One Class, 2D 0.441 ± 0.230 0.054 ± 0.069 0.319 ± 0.257 0.034 ± 0.058 Mahalanobis, 2D 0.446 ± 0.226 0.077 ± 0.077 0.362 ± 0.241 0.055 ± 0.073 SVM performance is preferable: Better Control in False Alarm rate Higher Detection rate

Change detection Consistent change in system behavior Different distributions in past and future Techniques Quantify distributions in past and future Measure the distance between distributions Detect change if distance higher than threshold One dimensional case Kolmogorov- Smirnov test Avoids explicit estimation of distribution Score is distribution-independent Fine tuning of threshold may be avoided

Kolmogorov Smirnov Test Quantifies difference between empirical distributions of samples from 1D continuous r.v. Actually measures maximal difference between the curves Returns probability for the two samples to be of the same distribution

Multivariate change detection Techniques: Quantify distributions in past and future Measure the distance between distributions Detect change if distance higher than threshold Temperature t9 t8 t7 t5 t4 t6 t2 t1 t3 Pressure

Score by KNN estimator of Kullback Leibler divergence x ν KNN avoids multidimensional distribution estimation For each point in cloud P calculate nearest neighbor distance in clouds P(ρ) and Q(ν) Ρ – Past : Past distance ν – Past : Future distance Compute Score = D(Past|| Future ) + D(Future||Past) ρ time t = 0 Past Future Window Faivishevsky, “INFORMATION THEORETIC MULTIVARIATE CHANGE DETECTION FOR MULTISENSORY INFORMATION PROCESSING IN INTERNET OF THINGS”, ICASSP 2016

Information Theoretic Multivariate Change detection algorithm Train: Threshold on KNN KL Past vs Future for a predefined Alarm rate f Test: Check whether KNN KL Past vs Future > Threshold Score Threshold #Windows 1-f f f

Change detection comparison, 10K runs, N(0,1) Test Window FA tuned Actual FA Detection Rate KL, detect µ = µ +1 50,10 (k=8) 0.001 0.0009 0.139 KS, detect µ = µ +1 50,10 0.0004 0.131 KL, detect µ = µ +2 0.0006 0.921 KS, detect µ = µ +2 0.871 30,10 (k=5) 0.01 0.013 0.31 KS, detect µ = µ + 1 30,10 0.005 0.326 KL, detect µ = µ + 1 30,10 (k=8) 0.006 0.343 KL, detect µ = µ + 2 0.008 0.960 KS, detect µ = µ + 2 0.94 KL, detect Ϭ = Ϭ + 1 0.012 0.063 KS, detect Ϭ = Ϭ + 1 0.007 0.015 KL, detect Ϭ = Ϭ + 2 0.011 0.132 KS, detect Ϭ = Ϭ + 2 0.004 0.025 KL, detect Ϭ = Ϭ + 3 0.009 0.264 KS, detect Ϭ = Ϭ + 3 0.036 KL and KS similar for Δµ detection, KL is better for ΔϬ, KL is better controllable

Multidimensional change detection, N(0,I(D)) Test Window FA tuned Actual FA Detection Rate KL, 20D, Σ =Σ +I(20) 30,10 (k=8) 0.01 0.010 0.560 KL, 2D, Σ =Σ +I(2) 0.007 0.067 KL, 20D, µ = µ +1 1.000 KL, 2D, µ = µ +1 0.638 KL, 2D, ρ = ρ + 0.9 0.02 0.35 KNN KL leverages multidimensional information to detect changes, that cannot be detected by one-dimensional methods: Small changes in µ Small changes in Σ Changes in ρ

Application of Keystrokes data to change detection Use session of consecutive 20X timings of a human as a start Stick session of consecutive 20X timings of another human Check whether A change detection method detects the stick point False alarm Repeat 1-3 to get substantial statistics

KNN KL Change Detection performance on real data Method Window Size Change Detection rate (tuned for FA 0.01) Actual False Alarm rate (tuned for FA 0.01) K statistics KNN KL Divergence, 20D 10 0.974 ± 0.056 0.029 ± 0.064 2 4 0.761 ± 0.175 0.019 ± 0.020 KNN KL Divergence, 2D 0.704 ± 0.184 0.019 ± 0.032 0.489 ± 0.077 0.016 ± 0.021

Thank you!