Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.

Clustering II

2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities of attribute values in that cluster Finite mixtures: finite number of clusters Individual distributions are usually normal Combine distributions using cluster weights Each normal distribution can be described in terms of μ (mean) and σ (standard deviation) For a single attribute with two clusters –μ A, σ A for cluster A and μ B, σ B for cluster B –The attribute values are obtained by combining values from cluster A with a probability of P A and from cluster B with a probability P B –Five parameters μ A, σ A, μ B, σ B and P A (because P A +P B =1) describe the attribute value distribution

3 EM Algorithm EM = Expectation – Maximization –Generalize k-means to probabilistic setting Input: Collection of instances and number of clusters, k Output: probabilities with which each instance belongs to each of the k clusters Method: –Start by guessing values for all the parameters of the k clusters (similar to guessing centroids in k-means) –Repeat E ‘Expectation’ Step: Calculate cluster probability for each instance M ‘Maximization’ Step: Estimate distribution parameters from cluster probabilities Store cluster probabilities as instance weights Estimate parameters from weighted instances –Until we obtain distribution parameters that predict the input data

4 Incremental Clustering (Cobweb/Classit) Input: Collection of instances Output: A hierarchy of clusters Method: –Start with an empty root node of the tree –Add instances one by one –if any of the existing leaves is a good ‘host’ for the incoming instance then form a cluster with it Good host has high category utility (next slide) –If required restructure the tree Cobweb - for nominal attributes Classit – for numerical attributes

5 Category Utility Category Utility, CU(C 1,C 2,…,C k ) = {∑ l P[C l ] ∑ i ∑ j (P[a i =v ij |C l ] 2 -P[a i =v ij ] 2 )}/k Computes the advantage in predicting the values of attributes of instances in a cluster –If knowing the cluster information of an instance does not help in predicting the values of its attributes, then the cluster isn’t worth forming The inner term of difference of squares of probabilities, (P[a i =v ij |C l ] 2 -P[a i =v ij ] 2 ) is computing this information The denominator, k is computing this information per cluster

6 Weather Data with ID IDOutlookTemperatureHumidityWindyPlay asunnyhothighfalseno bsunnyhothightrueno covercasthothighfalseyes drainymildhighfalseyes erainycoolnormalfalseyes frainycoolnormaltrueno govercastcoolnormaltrueyes hsunnymildhighfalseno isunnycoolnormalfalseyes jrainymildnormalfalseyes ksunnymildnormaltrueyes lovercastmildhightrueyes movercasthotnormalfalseyes nrainymildhightrueno Artificial data, therefore not possible to find natural clusters (two clusters of yeses and nos not possible)

7 Trace of Cobweb a:no e:yes a:nob:noc:yes d:yes 1 2 a:nob:noc:yes d:yes e:yesf:no e is the best host CU of e&f as cluster high e&f are similar 3 No good host for the first five instances

8 Trace of Cobweb (Contd) a:nob:noc:yes d:yes 4 e:yesf:nog:yes At root: e&f cluster best host At e&f: no host, so no new cluster, g added to e&f cluster f&g are similar b:no c:yes d:yes e:yesf:nog:yes 5 a:noh:no At root: a is the best host and d is the runner-up Before h is inserted runner-up, d is evaluated CU of a&d is high, so d merged into a to form a new cluster At a&d: no host, so no new cluster, h added to a&d cluster

9 Trace of Cobweb (Contd) n:nom:yesj:yesf:nog:yes i:yes e:yesl:yesc:yes h=no d:yesa:no k:yesb:no For large data sets, growing the tree to individual instances might lead to overfitting. A similarity threshold called cutoff used to suppress growth

10 Hierarchical Agglomerative Clustering Input: Collection of instances Output: A hierarchy of clusters Method: –Start with individual instances as clusters –Repeat Merge the ‘closest’ two clusters –Until only one cluster remains Ward’s method: Closeness or proximity between two clusters is defined as the increase in squared error that results when two clusters are merged Squared error measure used for only the local decision of merging clusters –No global optimization

11 HCE A visual knowledge discovery tool for analysing and understanding multi-dimensional (> 3D) data Offers multiple views of –input data and clustered input data –where views are coordinated Many other similar tools do a patch work of statistics and graphics HCE follows two fundamental statistical principles of exploratory data analysis –To examine each dimension first and then find relationships among dimensions –To try graphical displays first and then find numerical summaries

12 GRID Principles GRID – graphics, ranking and interaction for discovery Two principles –Study 1D, study 2D and find features –Ranking guides insight, statistics confirm These principles help users organize their knowledge discovery process Because of GRID, HCE is more than R + Visualization GRID can be used to derive some scripts to organize exploratory data analysis using R (or some such statistics package)

13 Rank-by-Feature Framework A user interface framework based on the GRID Principles The framework –Uses interactive information visualization techniques combined with –statistical methods and data mining algorithms –Enables users to orderly examine input data HCE implements rank-by-feature framework –This means HCE uses existing statistical and data mining methods to analyse input data and Communicate those results using interactive information visualization techniques

14 Multiple Views in HCE Dendrogram Colour Mosaic 1 D histograms 2D scatterplots And more

15 Dendrogram Display Results of HAC are shown visually using a dendrogram A dendrogram is a tree –with data items at the terminal (leaf) nodes –Distance from the root node represents similarity among leaf nodes Two visual controls –minimum similarity bar allows users to adjust the number of clusters –Detail cut-off bar allows users to reduce clutter A B C D

16 Colour Mosaic Input data is shown using this view Is a colour coded visual display of tabular data Each cell in the table is painted in a colour that reflects the cell’s value Two variations –The layout of the mosaic is similar to the original table –A transpose of the original layout HCE uses the transposed layout because data sets usually have more rows than columns A colour mapping control 12 34 Original layout Table Transposed Layout

17 1D Histogram Ordering This data view is part of the rank-by-feature framework Data belonging to one column (variable) is displayed as a histogram + box plot –Histogram shows the scale and skewness –Box plot shows the data distribution, center and spread For the entire data set many such views are possible By studying individual variables in detail users can select the variables for other visualizations

18 2D Scatter Plot Ordering This data view is again part of the rank-by-feature framework Three categories of 2D presentations are possible –Axes of the plot obtained from Principal Component Analysis Linear or non-linear combinations of original variables –Axes of the plot obtained directly from the original variables –Parallel coordinates HCE uses the second option of plotting pairs of variables from the original variables Both 1D and 2D plots can be sorted according to some user selected criteria such as number of outliers

Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.

Similar presentations

Presentation on theme: "Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.

Similar presentations

Presentation on theme: "Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities."— Presentation transcript:

Similar presentations

About project

Feedback