UIC - CS 5941 Chapter 5: Clustering. UIC - CS 5942 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering,

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Techniques: Clustering
What is Cluster Analysis?
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering (slide from Han and Kamber)
Clustering.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Spatial and Temporal Data Mining
What is Cluster Analysis?
Cluster Analysis.
What is Cluster Analysis?
Clustering Unsupervised learning Generating “classes”
2013 Teaching of Clustering
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis Part I
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
CIS664-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Clustering I (based on.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Chapter 2: Getting to Know Your Data
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Data Mining Lecture 7. Course Syllabus Clustering Techniques (Week 6) –K-Means Clustering –Other Clustering Techniques.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Data Mining: Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
What Is Good Clustering?
Clustering Wei Wang.
Topic 5: Cluster Analysis
What is Cluster Analysis?
Presentation transcript:

UIC - CS 5941 Chapter 5: Clustering

UIC - CS 5942 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data. Search for groups or clusters of data points (records) that are similar to one another. Similar points may mean: similar customers, products, that will behave in similar ways.

UIC - CS 5943 Group similar points together Group points into classes using some distance measures. Within-cluster distance, and between cluster distance Applications: As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

UIC - CS 5944 An Illustration

UIC - CS 5945 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics. City-planning: Identifying groups of houses according to their house type, value, and geographical location

UIC - CS 5946 Concepts of Clustering Clusters Different ways of representing clusters Division with boundaries Spheres Probabilistic Dendrograms … I1 I2 … In

UIC - CS 5947 Clustering Clustering quality Inter-clusters distance  maximized Intra-clusters distance  minimized The quality of a clustering result depends on both the similarity measure used by the method and its application. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Clustering vs. classification Which one is more difficult? Why? There are a huge number of clustering techniques.

UIC - CS 5948 Dissimilarity/Distance Measure Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j) The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.

UIC - CS 5949 Types of data in clustering analysis Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

UIC - CS Interval-valued variables Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc Standardize data (depending on applications) Calculate the mean absolute deviation: where Calculate the standardized measurement (z-score)

UIC - CS Similarity Between Objects Distance: Measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where (x i1, x i2, …, x ip ) and (x j1, x j2, …, x jp ) are two p- dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance

UIC - CS Similarity Between Objects (Cont.) If q = 2, d is Euclidean distance: Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also, one can use weighted distance, and many other similarity/distance measures.

UIC - CS Binary Variables A contingency table for binary data Simple matching coefficient (invariant, if the binary variable is symmetric): Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object i Object j

UIC - CS Dissimilarity of Binary Variables Example gender is a symmetric attribute (not used below) the remaining attributes are asymmetric attributes let the values Y and P be set to 1, and the value N be set to 0

UIC - CS Nominal Variables A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc Method 1: Simple matching m: # of matches, p: total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

UIC - CS Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled (f is a variable) replace x if by their ranks map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by compute the dissimilarity using methods for interval - scaled variables

UIC - CS Ratio-Scaled Variables Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt, e.g., growth of a bacteria population. Methods: treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted) apply logarithmic transformation y if = log(x if ) treat them as continuous ordinal data and then treat their ranks as interval-scaled

UIC - CS Variables of Mixed Types A database may contain all six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects f is binary or nominal: d ij (f) = 0 if x if = x jf, or d ij (f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks r if and and treat z if as interval - scaled

UIC - CS Major Clustering Techniques Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.

UIC - CS Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means : Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

UIC - CS The K-Means Clustering Given k, the k-means algorithm is as follows: 1)Choose k cluster centers to coincide with k randomly-chosen points 2)Assign each data point to the closest cluster center 3)Recompute the cluster centers using the current cluster memberships. 4)If a convergence criterion is not met, go to 2). Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error. p is a point and m i is the mean of cluster C i

UIC - CS Example For simplicity, 1 dimensional data and k=2. data: 1, 2, 5, 6,7 K-means: Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^ ^2 + 1^2 + 1^2 = 2.5

UIC - CS Comments on K-Means Strength: efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, difficult for categorical data Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes Sensitive to initial seeds

UIC - CS Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k seeds Dissimilarity measures Strategies to calculate cluster means Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency based method to update modes of clusters

UIC - CS k-Medoids clustering method k-Means algorithm is sensitive to outliers Since an object with an extremely large value may substantially distort the distribution of the data. Medoid – the most centrally located point in a cluster, as a representative point of the cluster. An example In contrast, a centroid is not necessarily inside a cluster. Initial Medoids

UIC - CS Partition Around Medoids PAM: 1. Given k 2. Randomly pick k instances as initial medoids 3. Assign each data point to the nearest medoid x 4. Calculate the objective function the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion) 5. Randomly select an point y 6. Swap x by y if the swap reduces the objective function 7. Repeat (3-6) until no change

UIC - CS Comments on PAM Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?) Pam works well for small data sets but does not scale well for large data sets. O(k(n-k) 2 ) for each change where n is # of data, k is # of clusters Outlier (100 unit away)

UIC - CS CLARA: Clustering Large Applications CLARA: Built in statistical analysis packages, such as S+ It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased There are other scale-up methods e.g., CLARANS

UIC - CS Hierarchical Clustering Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative divisive

UIC - CS Agglomerative Clustering At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least dissimilarity. Go on merging Eventually all nodes belong to the same cluster

UIC - CS A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

UIC - CS Divisive Clustering Inverse order of agglomerative clustering Eventually each node forms a cluster on its own

UIC - CS More on Hierarchical Methods Major weakness of agglomerative clustering methods do not scale well: time complexity at least O(n 2 ), where n is the total number of objects can never undo what was done previously Integration of hierarchical with distance-based clustering to scale-up these clustering methods BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

UIC - CS Summary Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc Clustering can also be used for outlier detection which are useful for fraud detection What is the best clustering algorithm?

UIC - CS Other Data Mining Methods

UIC - CS Sequence analysis Market basket analysis analyzes things that happen at the same time. How about things happen over time? E.g., If a customer buys a bed, he/she is likely to come to buy a mattress later Sequential analysis needs A time stamp for each data record customer identification

UIC - CS Sequence analysis (cont …) The analysis shows which item come before, after or at the same time as other items. Sequential patterns can be used for analyzing cause and effect. Other applications Finding cycles in association rules Some association rules hold strongly in certain periods of time E.g., every Monday people buy item X and Y together Stock market predicting Predicting possible failure in network, etc

UIC - CS Discovering holes in data Holes are empty (sparse) regions in the data space that contain few or no data points. Holes may represent impossible value combinations in the application domain. E.g., in a disease database, we may find that certain test values and/or symptoms do not go together, or when certain medicine is used, some test value never go beyond certain range. Such information could lead to significant discovery: a cure to a disease or some biological law.

UIC - CS Data and pattern visualization Data visualization: Use computer graphics effect to reveal the patterns in data, 2-D, 3-D scatter plots, bar charts, pie charts, line plots, animation, etc. Pattern visualization: Use good interface and graphics to present the results of data mining. Rule visualizer, cluster visualizer, etc

UIC - CS Scaling up data mining algorithms Adapt data mining algorithms to work on very large databases. Data reside on hard disk (too large to fit in main memory) Make fewer passes over the data Quadratic algorithms are too expensive Many data mining algorithms are quadratic, especially, clustering algorithms.