Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.

Clustering Algorithms for Numerical Data Sets

Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering Algorithms K-mean clustering 4.Density-based Clustering Algorithms Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

What is clustering? Clustering: the process of grouping a set of objects into classes of similar objects Most common form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given

Clustering

Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Introduction

Examples of Knowledge Extracted by Data Clustering For intelligent web search, data clustering can be conducted in advance on the terms contained in a set of training documents. The intelligent search engine can expand the query according to the term clusters, when the user submits a search term, then

For example, when the user submits “Federal Reserve Board”, the search engine automatically expands the query term to include additional search terms as follows:{“Greenspan”, “FED”}. The search engine may further rank the documents retrieved based on their correlation to the search terms.

Clustering – Reference matching Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann. Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990. Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.

Citation ranking

Clustering: Navigation of search results For grouping search results thematically – clusty.com / Vivisimo

Clustering: Corpus browsing dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculturebiologyphysicsCSspace... … (30) www.yahoo.com/Science...

Clustering considerations What does it mean for objects to be similar? What algorithm and approach do we take? – Top-down: k-means – Bottom-up: hierarchical agglomerative clustering Do we need a hierarchical arrangement of clusters? How many clusters? Can we label or name the clusters? How do we make it efficient and scalable?

Hierarchical Clustering Dendrogram Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Agglomerative (bottom-up): – Start with each instance being a single cluster. – Eventually all instances belong to the same cluster. Divisive (top-down): – Start with all instances belong to the same cluster. – Eventually each node forms a cluster on its own. Does not require the number of clusters k in advance Needs a termination/readout condition Hierarchical Clustering algorithms

Hierarchical Agglomerative Clustering (HAC) Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

connectedClustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. Dendogram: Hierarchical Clustering

Hierarchical Agglomerative Clustering (HAC) Starts with each doc in a separate cluster – then repeatedly joins the closest pair of clusters, until there is only one cluster. The history of merging forms a binary tree or hierarchy. How to measure distance of clusters??

Closest pair of clusters Many variants to defining closest pair of clusters Single-link – Distance of the “closest” points (single-link) Complete-link – Distance of the “furthest” points Centroid – Distance of the centroids (centers of gravity) (Average-link) – Average distance between pairs of elements

Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:

Single Link Example

Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is: CiCi CjCj CkCk

Complete Link Example

Key notion: cluster representative We want a notion of a representative point in a cluster Representative should be some sort of “typical” or central point in the cluster, e.g., – point inducing smallest radii to docs in cluster – smallest squared distances, etc. – point that is the “average” of all docs in the cluster Centroid or center of gravity

Centroid-based Similarity Always maintain average of vectors in each cluster: Compute similarity of clusters by: For non-vector data, can’t always make a centroid

Partitioning Algorithms Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions – Effective heuristic methods: K-means algorithms

K-Means Assumes instances are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.

K-Means Algorithm Select K random seeds. Until clustering converges or other stopping criterion: For each doc d i : Assign d i to the cluster c j such that dist(x i, s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j =  (c j ) How?

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!

Termination conditions Several possibilities, e.g., – A fixed number of iterations. – Partition unchanged. – Centroid positions don’t change. Does this mean that the instances in a cluster are unchanged?

Convergence Why should the K-means algorithm ever reach a fixed point? – A state in which clusters don’t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. – EM is known to converge. – Theoretically, number of iterations could be large. – Typically converges quickly

0 1 2 3 4 5 012345 k1k1 k2k2 k3k3 K-means Clustering: Step 1 Decide on a value for k.

0 1 2 3 4 5 012345 k1k1 k2k2 k3k3 K-means Clustering: Step 2 Initialize the k cluster centers

0 1 2 3 4 5 012345 k1k1 k2k2 k3k3 K-means Clustering: Step 3 Decide the class memberships of the N objects by assigning them to the nearest cluster center.

0 1 2 3 4 5 012345 k1k1 k2k2 k3k3 K-means Clustering: Step 4 Re-estimate the k cluster centers, by assuming the memberships found above are correct.

k1k1 k2k2 k3k3 K-means Clustering: Step 5 If none of the N objects1 changed membership in the last iteration, exit. Otherwise go to step 3.

How Many Clusters? Number of clusters K is given – Partition n docs into predetermined number of clusters Finding the “right” number of clusters is part of the problem – Given data, partition into an “appropriate” number of subsets. – E.g., for query results - ideal value of K not known up front - though UI may impose limits. Can usually take an algorithm for one flavor and convert to the other.

K not specified in advance Say, the results of a query. Solve an optimization problem: penalize having lots of clusters – application dependent, e.g., compressed summary of search results list. Tradeoff between having more clusters (better focus within each cluster) and having too many clusters

K not specified in advance Given a clustering, define the Benefit for a doc to be some inverse distance to its centroid Define the Total Benefit to be the sum of the individual doc Benefits.

Penalize lots of clusters For each cluster, we have a Cost C. Thus for a clustering with K clusters, the Total Cost is KC. Define the Value of a clustering to be = Total Benefit - Total Cost. Find the clustering of highest value, over all choices of K. – Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term enforces this.

Density-based Clustering Why Density-Based Clustering methods? Discover clusters of arbitrary shape. Clusters – Dense regions of objects separated by regions of low density – DBSCAN – the first density based clustering – OPTICS – density based cluster-ordering – DENCLUE – a general density-based description of cluster and clustering

Density-Based Clustering Why Density-Based Clustering? Results of a k-medoid algorithm for k=4 Basic Idea: Clusters are dense regions in the data space, separated by regions of lower object density Different density-based approaches exist Here we discuss the ideas underlying the DBSCAN algorithm

DBSCAN: Density Based Spatial Clustering of Applications with Noise Proposed by Ester, Kriegel, Sander, and Xu (KDD96) Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points. Discovers clusters of arbitrary shape in spatial databases with noise

DBSCAN Density-based Clustering locates regions of high density that are separated from one another by regions of low density. – Density = number of points within a specified radius (Eps) DBSCAN is a density-based algorithm. – A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster – A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point. – Any two core points are close enough– within a distance Eps of one another – are put in the same cluster – Any border point that is close enough to a core point is put in the same cluster as the core point – Noise points are discarded DBSCAN

Border & Core Core Border Outlier  = 1unit MinPts = 5

Concepts: ε-Neighborhood ε-Neighborhood ε-Neighborhood - Objects within a radius of ε from an object. (epsilon-neighborhood) Core objects MinPts Core objects - ε-Neighborhood of an object contains at least MinPts of objects q q p p εε ε-Neighborhood of p ε-Neighborhood of q p is a core object (MinPts = 4) q is not a core object

Concepts: Reachability Directly density-reachable Directly density-reachable – An object q is directly density-reachable from object p if q is within the ε-Neighborhood of p and p is a core object. q q p p εε q is directly density- reachable from p p is not directly density- reachable from q?

Concepts: Reachability Density-reachable: Density-reachable: – An object p is density-reachable from q w.r.t ε and MinPts if there is a chain of objects p 1,…,p n, with p 1 =q, p n =p such that p i+1 is directly density-reachable from p i w.r.t ε and MinPts for all 1 <= i <= n p p q is density-reachable from p p is not density- reachable from q? Transitive closure of direct density- Reachability, asymmetric q q

Concepts: Connectivity Density-connectivity Density-connectivity – Object p is density-connected to object q w.r.t ε and MinPts if there is an object o such that both p and q are density-reachable from o w.r.t ε and MinPts p p q q r r P and q are density-connected to each other by r Density-connectivity is symmetric

Concepts: cluster & noise Cluster Cluster: a cluster C in a set of objects D w.r.t ε and MinPts is a non empty subset of D satisfying – Maximality: For all p, q if p  C and if q is density-reachable from p w.r.t ε and MinPts, then also q  C. – Connectivity: for all p, q  C, p is density-connected to q w.r.t ε and MinPts in D. – Note: cluster contains core objects as well as border objects Noise: Noise: objects which are not directly density- reachable from at least one core object.

(Indirectly) Density-reachable: p q p1p1 pq o Density-connected

DBSCAN: The Algorithm – select a point p – Retrieve all points density-reachable from p wrt  and MinPts. – If p is a core point, a cluster is formed. – If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. – Continue the process until all of the points have been processed. Result is independent of the order of processing the points

An Example MinPts = 4  C1C1   C1C1 C1C1

DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther distance So, plot sorted distance of every point to its k th nearest neighbor

DBSCAN: Determining EPS and MinPts Distance from a point to its k th nearest neighbor=>k- dist For points that belong to some clusters, the value of k-dist will be small if k is not larger than cluster size For points that are not in a cluster such as noise points, the k-dist will be relatively large Compute k-dist for all points for some k Sort them in increasing order and plot sorted values A sharp change at the value of k-dist that corresponds to suitable value of eps and the value of k as MinPts

DBSCAN: Determining EPS and MinPts A sharp change at the value of k-dist that corresponds to suitable value of eps and the value of k as MinPts – Points for which k-dist is less than eps will be labeled as core points while other points will be labeled as noise or border points. If k is too large=> small clusters (of size less than k) are likely to be labeled as noise If k is too small=> Even a small number of closely spaced that are noise or outliers will be incorrectly labeled as clusters

Clusters Identified by the DBSCAN Algorithm A density-based cluster is a set of density- connected objects that is maximal with respect to density-reachability. An object not contained in any cluster is considered to be noise.

On Class Exercise 1 Data – Iris.arff and your own data (if applicable) Method – Hierarchical algorithms – Parameter (num of cluster = 3) Software – Weka 3.7.3 Steps – Explorer->Cluster->Clusterer (Hierachical Clusterer)

On Class Exercise 2 Data – Iris.arff, and your own data (if applicable) Method – K-means – Parameter (num of cluster = 3) Software – Weka 3.7.3 Steps Explorer -> Cluster->Clusterer (SimpleKMeans)

On Class Exercise 3 Data – Iris.arff and your own data (if applicable ) Method – DBSCAN – Parameter (num of cluster = 3) Software – Weka 3.7.3 Steps – Explorer->Cluster->Clusterer (DBScan)

Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.

Similar presentations

Presentation on theme: "Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.

Similar presentations

Presentation on theme: "Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering."— Presentation transcript:

Similar presentations

About project

Feedback