Introduction to Data Science: Lecture 6

Introduction to Data Science: Lecture 6
March, 2016 Introduction to Data Science: Lecture 6 Dr. Lev Faivishevsky

Agenda Clustering Anomaly Detection Change Detection Hierarchical
K-means GMM Anomaly Detection Change Detection

Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided).

Clustering Example . . .

Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Aglommerative vs. Divisive Clustering
Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters.

Direct Clustering Method
Direct clustering methods require a specification of the number of clusters, k, desired. A clustering evaluation function assigns a real-value quality measure to a clustering. The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.

Hierarchical Agglomerative Clustering (HAC)
Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

HAC Algorithm Start with all instances in their own cluster.
Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj

Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members.

Single Link Agglomerative Clustering
Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. Appropriate in some domains, such as clustering islands.

Single Link Example

Complete Link Agglomerative Clustering
Use minimum similarity of pairs: Makes more “tight,” spherical clusters that are typically preferable.

Complete Link Example

Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.

Computing Cluster Similarity
After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by: Single Link: Complete Link:

Group Average Agglomerative Clustering
Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters to encourage tight clusters.

Computing Group Average Similarity
Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time:

Non-Hierarchical Clustering
Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.

Distances: Ordinal and Categorical Variables
Ordinal variables can be forced to lie within (0, 1) and then a quantitative metric can be applied: For categorical variables, distances must be specified by user between each pair of categories. Often weighted sum is used:

K-means Overview An unsupervised clustering algorithm
“K” stands for number of clusters, it is typically a user input to the algorithm; some criteria can be used to automatically estimate K It is an approximation to an NP-hard combinatorial optimization problem K-means algorithm is iterative in nature It converges, however only a local minimum is obtained Works only for numerical data Easy to implement

K-means: Setup x1,…, xN are data points or vectors of observations
Each observation (vector xi) will be assigned to one and only one cluster C(i) denotes cluster number for the ith observation Dissimilarity measure: Euclidean distance metric K-means minimizes within-cluster point scatter: mk is the mean vector of the kth cluster Nk is the number of observations in kth cluster

K-means Algorithm For a given cluster assignment C of the data points, compute the cluster means mk: For a current set of cluster means, assign each observation as: Iterate above two steps until convergence

K-means clustering example

Distance Metric: Euclidean Distance
K-means: Example 2, Step 1 Distance Metric: Euclidean Distance k1 k2 k3

K-means: summary Algorithmically, very simple to implement
K-means converges, but it finds a local minimum of the cost function Works only for numerical observations K is a user input; Outliers can considerable trouble to K-means

The Problem You have data that you believe is drawn from n populations
You want to identify parameters for each population You don’t know anything about the populations a priori Except you believe that they’re gaussian…

Gaussian Mixture Models
Rather than identifying clusters by “nearest” centroids Fit a Set of k Gaussians to the data Maximum Likelihood over a mixture model p(x) = \pi_0f_0(x) + \pi_1f_1(x) + \pi_2f_2(x) + \ldots + \pi_kf_k(x)

GMM example

Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Gaussian Mixture Models
GMM: the weighted sum of a number of Gaussians where the weights are determined by a distribution,

Graphical Models with unobserved variables
What if you have variables in a Graphical model that are never observed? Latent Variables Training latent variable models is an unsupervised learning application uncomfortable amused sweating laughing

Latent Variable HMMs We can cluster sequences using an HMM with unobserved state variables We will train latent variable models using Expectation Maximization

Expectation Maximization
Both the training of GMMs and Graphical Models with latent variables can be accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the “responsibilities” of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing “responsibilities” Similar to k-means training.

Latent Variable Representation
We can represent a GMM involving a latent variable What does this give us?

GMM data and Latent variables

One last bit We have representations of the joint p(x,z) and the marginal, p(x)… The conditional of p(z|x) can be derived using Bayes rule. The responsibility that a mixture component takes for explaining an observation x.

Maximum Likelihood over a GMM
As usual: Identify a likelihood function And set partials to zero…

Maximum Likelihood of a GMM
Optimization of means.

Optimization of covariance

Optimization of mixing term \frac{\partial \ln p(x|\pi, \mu,\Sigma) + \lambda\left(\sum_{k=1}^K \pi_k -1\right)}{\partial \pi_k}&=&

MLE of a GMM

EM for GMMs Initialize the parameters
Evaluate the log likelihood Expectation-step: Evaluate the responsibilities Maximization-step: Re-estimate Parameters Check for convergence

EM for GMMs E-step: Evaluate the Responsibilities

EM for GMMs M-Step: Re-estimate Parameters

Visual example of EM

Potential Problems Incorrect number of Mixture Components
Singularities

Incorrect Number of Gaussians

Singularities A minority of the data can have a disproportionate effect on the model likelihood. For example…

GMM example

Relationship to K-means
K-means makes hard decisions. Each data point gets assigned to a single cluster. GMM/EM makes soft decisions. Each data point can yield a posterior p(z|x) Soft K-means is a special case of EM.

Soft means as GMM/EM Assume equal covariance matrices for every mixture component: Likelihood: Responsibilities: As epsilon approaches zero, the responsibility approaches unity. p(x|\mu_k,\Sigma_k) = \frac{1}{(2\pi\epsilon)^{M/2}}\exp\left\{-\frac{1}{2\epsilon}\lVert x-\mu_k\rVert^2\right\}

Soft K-Means as GMM/EM Overall Log likelihood as epsilon approaches zero: The expectation of soft k-means is the intercluster variability Note: only the means are reestimated in Soft K-means. The covariance matrices are all tied.

General form of EM Given a joint distribution over observed and latent variables: Want to maximize: Initialize parameters E Step: Evaluate: M-Step: Re-estimate parameters (based on expectation of complete-data log likelihood) Check for convergence of params or likelihood

AnomalyDetection

Explored Methods Change detection Anomaly detection
KNN based Kullback Leibler Divergence Compared with Kolmogorov Smirnov (1D) Anomaly detection One Class SVM Compared with Mahalanobis distance

Anomaly detection Single sample detection
Outlier wrt baseline behavior Techniques Quantify usual behavior (train) “Multidimensional distribution” Measure probability for current point (test) Declare ‘outlier’, if p < threshold Methods used One class SVM Mahalanobis distance

Mahalanobis distance Data are assumed N(µ,S)
Fit (µ,S) from data (train) Fine tune threshold on Use validation set Detect outlier x with distance > threshold

One class SVM Cast of ordinary binary SVM
State-of-the-art novelty detection Smallest volume sphere with (1-ν) of data inside Prob(outlier) = ν ν comes explicitly into SVM target function Robustness Control of False Alarm rate Optionally fine tune threshold Use validation set Define precise location of decision surface ρ

Multidim. anomaly detection, 10K runs, N(0,I(D))
Test FA tuned Actual FA Detection Rate SVM, 20D, Σ =Σ +I(20) 0.02 0.015 0.596 Mahalanobis, 20D, Σ =Σ +I(20) 0.016 0.601 SVM, 2D, Σ =Σ +I(2) 0.028 0.144 Mahalanobis, 2D, Σ =Σ +I(2) 0.023 0.150 SVM, 2D, µ = µ +1 0.022 0.112 Mahalanobis, 2D, µ = µ +1 0.131 SVM, 20D, µ = µ +1 0.017 0.613 Mahalanobis, 20D, µ = µ +1 0.622 SVM, 2D, ρ = ρ + 0.9 0.018 0.038 Mahalanobis, 2D, ρ = ρ + 0.9 0.020 0.041

Keystroke – Real world finger typing timings
Real world data of finger typing timings Same 10-letter password is repetitively typed 51 human subjects 8 daily sessions per each human 50 repetitions in each daily session Each typing is characterized by 20 timings of key up – key pressed Overall each human is represented by 400 samples * 20 sensors Dataset applicable to Anomaly detection Change detection Knowledge extraction (mutliclass classification) Full description and some R implementations at

Performance comparison on real data
Method Anomaly detection rate (tuned for FA 0.05) Actual False Alarm rate (tuned for FA 0.05) Anomaly detection rate (tuned for FA 0.02) (tuned for FA 0.02) SVM One Class, 20D 0.59 ± 0.281 0.050 ± 0.068 0.448 ± 0.304 ± Mahalanobis, 20D 0.55 ± 0.295 0.062 ± 0.074 0.464 ± 0.307 0.045 ± 0.065 SVM One Class, 2D 0.441 ± 0.230 0.054 ± 0.069 ± 0.257 0.034 ± Mahalanobis, 2D 0.446 ± 0.226 0.077 ± 0.077 ± 0.055 ± SVM performance is preferable: Better Control in False Alarm rate Higher Detection rate

Change detection Consistent change in system behavior
Different distributions in past and future Techniques Quantify distributions in past and future Measure the distance between distributions Detect change if distance higher than threshold One dimensional case Kolmogorov- Smirnov test Avoids explicit estimation of distribution Score is distribution-independent Fine tuning of threshold may be avoided

Kolmogorov Smirnov Test
Quantifies difference between empirical distributions of samples from 1D continuous r.v. Actually measures maximal difference between the curves Returns probability for the two samples to be of the same distribution

Multivariate change detection
Techniques: Quantify distributions in past and future Measure the distance between distributions Detect change if distance higher than threshold Temperature t9 t8 t7 t5 t4 t6 t2 t1 t3 Pressure

Score by KNN estimator of Kullback Leibler divergence
x ν KNN avoids multidimensional distribution estimation For each point in cloud P calculate nearest neighbor distance in clouds P(ρ) and Q(ν) Ρ – Past : Past distance ν – Past : Future distance Compute Score = D(Past|| Future ) + D(Future||Past) ρ time t = 0 Past Future Window Faivishevsky, “INFORMATION THEORETIC MULTIVARIATE CHANGE DETECTION FOR MULTISENSORY INFORMATION PROCESSING IN INTERNET OF THINGS”, ICASSP 2016

Information Theoretic Multivariate Change detection algorithm
Train: Threshold on KNN KL Past vs Future for a predefined Alarm rate f Test: Check whether KNN KL Past vs Future > Threshold Score Threshold #Windows 1-f f f

Change detection comparison, 10K runs, N(0,1)
Test Window FA tuned Actual FA Detection Rate KL, detect µ = µ +1 50,10 (k=8) 0.001 0.0009 0.139 KS, detect µ = µ +1 50,10 0.0004 0.131 KL, detect µ = µ +2 0.0006 0.921 KS, detect µ = µ +2 0.871 30,10 (k=5) 0.01 0.013 0.31 KS, detect µ = µ + 1 30,10 0.005 0.326 KL, detect µ = µ + 1 30,10 (k=8) 0.006 0.343 KL, detect µ = µ + 2 0.008 0.960 KS, detect µ = µ + 2 0.94 KL, detect Ϭ = Ϭ + 1 0.012 0.063 KS, detect Ϭ = Ϭ + 1 0.007 0.015 KL, detect Ϭ = Ϭ + 2 0.011 0.132 KS, detect Ϭ = Ϭ + 2 0.004 0.025 KL, detect Ϭ = Ϭ + 3 0.009 0.264 KS, detect Ϭ = Ϭ + 3 0.036 KL and KS similar for Δµ detection, KL is better for ΔϬ, KL is better controllable

Multidimensional change detection, N(0,I(D))
Test Window FA tuned Actual FA Detection Rate KL, 20D, Σ =Σ +I(20) 30,10 (k=8) 0.01 0.010 0.560 KL, 2D, Σ =Σ +I(2) 0.007 0.067 KL, 20D, µ = µ +1 1.000 KL, 2D, µ = µ +1 0.638 KL, 2D, ρ = ρ + 0.9 0.02 0.35 KNN KL leverages multidimensional information to detect changes, that cannot be detected by one-dimensional methods: Small changes in µ Small changes in Σ Changes in ρ

Application of Keystrokes data to change detection
Use session of consecutive 20X timings of a human as a start Stick session of consecutive 20X timings of another human Check whether A change detection method detects the stick point False alarm Repeat 1-3 to get substantial statistics

KNN KL Change Detection performance on real data
Method Window Size Change Detection rate (tuned for FA 0.01) Actual False Alarm rate (tuned for FA 0.01) K statistics KNN KL Divergence, 20D 10 0.974 ± 0.056 0.029 ± 0.064 2 4 0.761 ± 0.175 0.019 ± 0.020 KNN KL Divergence, 2D 0.704 ± 0.184 0.019 ± 0.032 0.489 ± 0.077 0.016 ± 0.021

Thank you!

Introduction to Data Science: Lecture 6

Similar presentations

Presentation on theme: "Introduction to Data Science: Lecture 6"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Data Science: Lecture 6

Similar presentations

Presentation on theme: "Introduction to Data Science: Lecture 6"— Presentation transcript:

Similar presentations

About project

Feedback