Presentation is loading. Please wait.

Presentation is loading. Please wait.

CogNova Technologies 1 COMP3503 Automated Discovery and Clustering Methods COMP3503 Automated Discovery and Clustering Methods Daniel L. Silver.

Similar presentations


Presentation on theme: "CogNova Technologies 1 COMP3503 Automated Discovery and Clustering Methods COMP3503 Automated Discovery and Clustering Methods Daniel L. Silver."— Presentation transcript:

1 CogNova Technologies 1 COMP3503 Automated Discovery and Clustering Methods COMP3503 Automated Discovery and Clustering Methods Daniel L. Silver

2 CogNova Technologies 2 Agenda  Automated Exploration/Discovery (unsupervised clustering methods)  K-Means Clustering Method  Kohonen Self Organizing Maps (SOM)

3 CogNova Technologies 3 Overview of Modeling Methods  Automated Exploration/Discovery e.g.. discovering new market segments e.g.. discovering new market segments distance and probabilistic clustering algorithms distance and probabilistic clustering algorithms  Prediction/Classification e.g.. forecasting gross sales given current factors e.g.. forecasting gross sales given current factors statistics (regression, K-nearest neighbour) statistics (regression, K-nearest neighbour) artificial neural networks, genetic algorithms artificial neural networks, genetic algorithms  Explanation/Description e.g.. characterizing customers by demographics e.g.. characterizing customers by demographics inductive decision trees/rules inductive decision trees/rules rough sets, Bayesian belief nets rough sets, Bayesian belief nets x1 x2 f(x) x if age > 35 and income < $35k then... A B

4 CogNova Technologies 4 Automated Exploration/Discovery Through Unsupervised Learning Objective: To induce a model without use of a target (supervisory) variable such that similar examples are grouped into self- organized clusters or catagories. This can be considered a method of unsupervised concept learning. There is no explicit teaching signal. x1 x2 A C B

5 CogNova Technologies 5 Classification Systems and Inductive Learning Basic Framework for Inductive Learning Inductive Learning System Environment Training Examples Testing Examples Induced Model of Classifier Output Classification (x, f(x)) (x, h(x)) h(x) = f(x)? A problem of representation and search for the best hypothesis, h(x). ~ THIS FRAMEWORK IS NOT APPLICABLE TO UNSUPERVISED LEARNING

6 CogNova Technologies 6 Automated Exploration/Discovery Through Unsupervised Learning Multi-dimensional Feature Space Age Income $ Education

7 CogNova Technologies 7 Automated Exploration/Discovery Through Unsupervised Learning Common Uses  Market segmentation  Population catagorization  Product/service catagorization  Automated subject indexing (WEBSOM)  Multi-variable (vector) quantization reduce several variables to one reduce several variables to one

8 CogNova Technologies 8 8 ● Clustering techniques apply when there is no class to be predicted ● Aim: divide instances into “natural” groups ● Clusters can be: ● disjoint vs. overlapping ● deterministic vs. probabilistic ● flat vs. hierarchical WHF 3.6 & 4.8 Clustering

9 CogNova Technologies 9 9 Representing clusters I Simple 2-D representation Venn diagram Overlapping clusters

10 CogNova Technologies 10 Representing clusters II 1 2 3 a 0.40.1 0.5 b 0.10.8 0.1 c 0.30.3 0.4 d 0.10.1 0.8 e 0.40.2 0.4 f 0.10.4 0.5 g 0.70.2 0.1 h 0.50.4 0.1 … Probabilistic assignment Dendrogram NB: dendron is the Greek word for tree

11 CogNova Technologies 11 ● The classic clustering algorithm -- k-means  k-means clusters are disjoint, deterministic, and flat 3.6 & 4.8 Clustering

12 CogNova Technologies 12 K-Means Clustering Method  Consider m examples each with 2 attributes [ x,y ] in a 2D input space  Method depends on storing all examples  Set number of clusters, K  Centroid of clusters is initially the average coordinates of first K examples or randomly chosen coordinates

13 CogNova Technologies 13 K-Means Clustering Method  Until cluster boundaries stop changing Assign each example to the cluster who ’ s centroid is nearest - using some distance measure ( e.g. Euclidean distance)Assign each example to the cluster who ’ s centroid is nearest - using some distance measure ( e.g. Euclidean distance) Recalculate centroid of each cluster:Recalculate centroid of each cluster: e.g. [mean(x), mean(y)] for all examples currently in cluster K

14 CogNova Technologies 14 DEMO: http://home.dei.polimi.it/matteucc/Clustering/tutorial_ht ml/AppletKM.html 3.6 & 4.8 Clustering

15 CogNova Technologies 15 K-Means Clustering Method  Advantages: simple algorithm to implementsimple algorithm to implement can form very interesting groupingscan form very interesting groupings clusters can be characterized by attributes / valuesclusters can be characterized by attributes / values sequential learning methodsequential learning method  Disadvantages: all variables must be ordinal in nature all variables must be ordinal in nature problems transforming categorical variables problems transforming categorical variables requires lots of memory, computation time requires lots of memory, computation time does poorly for large numbers of attributes (curse of dimensionality) does poorly for large numbers of attributes (curse of dimensionality)

16 CogNova Technologies 16 Discussion ● Algorithm minimizes squared distance to cluster centers ● Result can vary significantly based on initial choice of seeds ● Can get trapped in local minimum  Example: ● To increase chance of finding global optimum: restart with different random seeds ● Can we applied recursively with k = 2 instances initial cluster centres

17 CogNova Technologies 17 Kohonen SOM (Self Organizing Feature Map)  Implements a version of K-Means  Two layer feed-forward neural network  Input layer fully connected to an output layer of N outputs arranged in 2D; weights initialized to small random values  Objective is to arrange the outputs into a map which is topologically organized according to the features presented in the data

18 CogNova Technologies 18 Kohonen SOM The Training Algorithm  Present an example to the inputs  The winning output is that whose weights are “ closest ” to the input values ( e.g. Euclidean dist.)  Adjust the winners weights slightly to make it more like the example  Adjust the weights of the neighbouring output nodes relative to proximity to chosen output node  Reduce the neighbourhood size Repeated for I iterations through examples

19 CogNova Technologies 19 SOM Demos   http://www.eee.metu.edu.tr/~alatan/Courses/De mo/Kohonen.htm http://www.eee.metu.edu.tr/~alatan/Courses/De mo/Kohonen.htm   http://davis.wpi.edu/~matt/courses/soms/applet. html http://davis.wpi.edu/~matt/courses/soms/applet. html

20 CogNova Technologies 20 Kohonen SOM  In the end, the network effectively quantizes the input vector of each example to a single output node  The weights to that node indicate the feature values that characterize the cluster  Topology of map shows association/ proximity of clusters  Biological justification - evidence of localized topological mapping in neocortex

21 CogNova Technologies 21 Faster distance calculations ● Can we use kD-trees or ball trees to speed up the process? Yes: ● First, build tree, which remains static, for all the data points ● At each node, store number of instances and sum of all instances ● In each iteration, descend tree and find out which cluster each node belongs to ● Can stop descending as soon as we find out that a node belongs entirely to a particular cluster ● Use statistics stored at the nodes to compute new cluster centers

22 CogNova Technologies 22 Example

23 CogNova Technologies 23 6.8 Clustering: how many clusters? ● How to choose k in k-means? Possibilities: ● Choose k that minimizes cross-validated squared distance to cluster centers ● Use penalized squared distance on the training data (eg. using an MDL criterion) ● Apply k-means recursively with k = 2 and use stopping criterion (eg. based on MDL) ● Seeds for subclusters can be chosen by seeding along direction of greatest variance in cluster (one standard deviation away in each direction from cluster center of parent cluster) ● Implemented in algorithm called X-means (using Bayesian Information Criterion instead of MDL)

24 CogNova Technologies 24 Hierarchical clustering ● Recursively splitting clusters produces a hierarchy that can be represented as a dendogram  Could also be represented as a Venn diagram of sets and subsets (without intersections)  Height of each node in the dendogram can be made proportional to the dissimilarity between its children

25 CogNova Technologies 25 Agglomerative clustering ● Bottom-up approach ● Simple algorithm  Requires a distance/similarity measure  Start by considering each instance to be a cluster  Find the two closest clusters and merge them  Continue merging until only one cluster is left  The record of mergings forms a hierarchical clustering structure – a binary dendogram

26 CogNova Technologies 26 Distance measures ● Single-linkage  Minimum distance between the two clusters  Distance between the clusters closest two members  Can be sensitive to outliers ● Complete-linkage  Maximum distance between the two clusters  Two clusters are considered close only if all instances in their union are relatively similar  Also sensitive to outliers  Seeks compact clusters

27 CogNova Technologies 27 Distance measures cont. ● Compromise between the extremes of minimum and maximum distance ● Represent clusters by their centroid, and use distance between centroids – centroid linkage  Works well for instances in multidimensional Euclidean space  Not so good if all we have is pairwise similarity between instances ● Calculate average distance between each pair of members of the two clusters – average-linkage ● Technical deficiency of both: results depend on the numerical scale on which distances are measured

28 CogNova Technologies 28 More distance measures ● Group-average clustering  Uses the average distance between all members of the merged cluster  Differs from average-linkage because it includes pairs from the same original cluster ● Ward's clustering method  Calculates the increase in the sum of squares of the distances of the instances from the centroid before and after fusing two clusters  Minimize the increase in this squared distance at each clustering step ● All measures will produce the same result if the clusters are compact and well separated

29 CogNova Technologies 29 Example hierarchical clustering ● 50 examples of different creatures from the zoo data DendogramPolar plot Complete-linkage

30 CogNova Technologies 30 Example hierarchical clustering 2 Single-linkage

31 CogNova Technologies 31 Incremental clustering ● Heuristic approach (COBWEB/CLASSIT) ● Form a hierarchy of clusters incrementally ● Start: ● tree consists of empty root node ● Then: ● add instances one by one ● update tree appropriately at each stage ● to update, find the right leaf for an instance ● May involve restructuring the tree ● Base update decisions on category utility

32 CogNova Technologies 32 Clustering weather data N M L K J I H G F E D C B A ID TrueHighMildRainy FalseNormalHotOvercast TrueHighMildOvercast TrueNormalMildSunny FalseNormalMildRainy FalseNormalCoolSunny FalseHighMildSunny TrueNormalCoolOvercast TrueNormalCoolRainy FalseNormalCoolRainy FalseHighMildRainy FalseHighHotOvercast TrueHighHotSunny FalseHighHotSunny WindyHumidityTemp.Outlook 1 2 3

33 CogNova Technologies 33 Clustering weather data N M L K J I H G F E D C B A ID TrueHighMildRainy FalseNormalHotOvercast TrueHighMildOvercast TrueNormalMildSunny FalseNormalMildRainy FalseNormalCoolSunny FalseHighMildSunny TrueNormalCoolOvercast TrueNormalCoolRainy FalseNormalCoolRainy FalseHighMildRainy FalseHighHotOvercast TrueHighHotSunny FalseHighHotSunny WindyHumidityTemp.Outlook 4 3 Merge best host and runner-up 5 Consider splitting the best host if merging doesn’t help

34 CogNova Technologies 34 Final hierarchy

35 CogNova Technologies 35 Example: the iris data (subset)

36 CogNova Technologies 36 Clustering with cutoff

37 CogNova Technologies 37 Category utility ● Category utility: quadratic loss function defined on conditional probabilities: ● Every instance in different category  numerator becomes maximum number of attributes

38 CogNova Technologies 38 Numeric attributes

39 CogNova Technologies 39 Probability-based clustering ● Problems with heuristic approach: ● Division by k? ● Order of examples? ● Are restructuring operations sufficient? ● Is result at least local minimum of category utility? ● Probabilistic perspective  seek the most likely clusters given the data ● Also: instance belongs to a particular cluster with a certain probability

40 CogNova Technologies 40 Finite mixtures ● Model data using a mixture of distributions ● One cluster, one distribution ● governs probabilities of attribute values in that cluster ● Finite mixtures : finite number of clusters ● Individual distributions are normal (Gaussian) ● Combine distributions using cluster weights

41 CogNova Technologies 41 Two-class mixture model A 51 A 43 B 62 B 64 A 45 A 42 A 46 A 45 A 45 B 62 A 47 A 52 B 64 A 51 B 65 A 48 A 49 A 46 B 64 A 51 A 52 B 62 A 49 A 48 B 62 A 43 A 40 A 48 B 64 A 51 B 63 A 43 B 65 B 66 B 65 A 46 A 39 B 62 B 64 A 52 B 63 B 64 A 48 B 64 A 48 A 51 A 48 B 64 A 42 A 48 A 41 data model  A =50,  A =5, p A =0.6  B =65,  B =2, p B =0.4

42 CogNova Technologies 42 Using the mixture model

43 CogNova Technologies 43 Learning the clusters ● Assume: ● we know there are k clusters ● Learn the clusters  ● determine their parameters ● I.e. means and standard deviations ● Performance criterion: ● probability of training data given the clusters ● EM algorithm ● finds a local maximum of the likelihood

44 CogNova Technologies 44 EM algorithm ● EM = Expectation-Maximization ● Generalize k-means to probabilistic setting ● Iterative procedure: ● E “expectation” step: Calculate cluster probability for each instance ● M “maximization” step: Estimate distribution parameters from cluster probabilities ● Store cluster probabilities as instance weights ● Stop when improvement is negligible

45 CogNova Technologies 45 More on EM ● Estimate parameters from weighted instances ● Stop when log-likelihood saturates ● Log-likelihood:

46 CogNova Technologies 46 Extending the mixture model ● More then two distributions: easy ● Several attributes: easy—assuming independence! ● Correlated attributes: difficult ● Joint model: bivariate normal distribution with a (symmetric) covariance matrix ● n attributes: need to estimate n + n (n+1)/2 parameters

47 CogNova Technologies 47 More mixture model extensions ● Nominal attributes: easy if independent ● Correlated nominal attributes: difficult ● Two correlated attributes  v 1 v 2 parameters ● Missing values: easy ● Can use other distributions than normal: ● “log-normal” if predetermined minimum is given ● “log-odds” if bounded from above and below ● Poisson for attributes that are integer counts ● Use cross-validation to estimate k !

48 CogNova Technologies 48 Bayesian clustering ● Problem: many parameters  EM overfits ● Bayesian approach : give every parameter a prior probability distribution ● Incorporate prior into overall likelihood figure ● Penalizes introduction of parameters ● Eg: Laplace estimator for nominal attributes ● Can also have prior on number of clusters! ● Implementation: NASA’s AUTOCLASS

49 CogNova Technologies 49 Discussion ● Can interpret clusters by using supervised learning ● post-processing step ● Decrease dependence between attributes? ● pre-processing step ● E.g. use principal component analysis ● Can be used to fill in missing values ● Key advantage of probabilistic clustering: ● Can estimate likelihood of data ● Use it to compare different models objectively

50 CogNova Technologies 50 WEKA Tutroial K-means, EM and Cobweb and the Weather data http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex3.html

51 CogNova Technologies 51

52 CogNova Technologies 52 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Semisupervised learning ● Semisupervised learning: attempts to use unlabeled data as well as labeled data  The aim is to improve classification performance ● Why try to do this? Unlabeled data is often plentiful and labeling data can be expensive  Web mining: classifying web pages  Text mining: identifying names in text  Video mining: classifying people in the news ● Leveraging the large pool of unlabeled examples would be very attractive

53 CogNova Technologies 53 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7) Clustering for classification ● Idea: use naïve Bayes on labeled examples and then apply EM ● First, build naïve Bayes model on labeled data ● Second, label unlabeled data based on class probabilities (“expectation” step) ● Third, train new naïve Bayes model based on all the data (“maximization” step) ● Fourth, repeat 2 nd and 3 rd step until convergence ● Essentially the same as EM for clustering with fixed cluster membership probabilities for labeled data and #clusters = #classes

54 CogNova Technologies 54 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7) Comments ● Has been applied successfully to document classification ● Certain phrases are indicative of classes ● Some of these phrases occur only in the unlabeled data, some in both sets ● EM can generalize the model by taking advantage of co- occurrence of these phrases ● Refinement 1: reduce weight of unlabeled data ● Refinement 2: allow multiple clusters per class

55 CogNova Technologies 55 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7) Co-training ● Method for learning from multiple views (multiple sets of attributes), eg: ● First set of attributes describes content of web page ● Second set of attributes describes links that link to the web page ● Step 1: build model from each view ● Step 2: use models to assign labels to unlabeled data ● Step 3: select those unlabeled examples that were most confidently predicted (ideally, preserving ratio of classes) ● Step 4: add those examples to the training set ● Step 5: go to Step 1 until data exhausted ● Assumption: views are independent

56 CogNova Technologies 56 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 7) EM and co-training ● Like EM for semisupervised learning, but view is switched in each iteration of EM ● Uses all the unlabeled data (probabilistically labeled) for training ● Has also been used successfully with support vector machines ● Using logistic models fit to output of SVMs ● Co-training also seems to work when views are chosen randomly! ● Why? Possibly because co-trained classifier is more robust

57 CogNova Technologies 57 THE END danny.silver@acadiau.ca


Download ppt "CogNova Technologies 1 COMP3503 Automated Discovery and Clustering Methods COMP3503 Automated Discovery and Clustering Methods Daniel L. Silver."

Similar presentations


Ads by Google