October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
CS690L: Clustering References:
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Cluster Analysis.
Clustering.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Spatial and Temporal Data Mining
Cluster Analysis.
Cluster Analysis CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
What is Cluster Analysis?
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis Part I
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Clustering Sunita Sarawagi
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
November 1, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques Clustering.
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
CIS664-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Clustering I (based on.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Chapter 2: Getting to Know Your Data
Data Mining and Warehousing: Chapter 8
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Presented by Ho Wai Shing
Clustering.
Han, Kamber, Eick: Object Similarity & Clustering for COSC Clustering and Similarity Assessment ©Jiawei Han and Micheline Kamber with major Additions.
Han: Clustering1 Clustering — Slides for Textbook — — Chapter 8 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of.
Data Mining Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Han, Kamber, Eick: Object Similarity & Clustering for COSC March 4+9: Introduction to KDD March 11: Association Rule Mining March 23: Similarity.
Data Mining Lecture 6. Course Syllabus Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)
Han, Kamber, Eick: Object Similarity & Clustering 1 Clustering and Object Similarity Evaluation ©Jiawei Han and Micheline Kamber with Additions and Modifications.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Data Mining Lecture 7. Course Syllabus Clustering Techniques (Week 6) –K-Means Clustering –Other Clustering Techniques.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
GROUP 6 KIIZA FELIX 2013/BIT/110 MUHANGUZI EUSTUS 2013/BIT/104/PS TUGIROKWIKIRIZA FLAVIA 2013/BIT/111/PS HAMSTONE NATOSHA 2013/BIT/122/PS GILBERT MUMBERE.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber Department of Computer Science
©Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques Clustering
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
Self organizing networks
Clustering and Object Similarity Evaluation
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
©Jiawei Han and Micheline Kamber
CSCI N317 Computation for Scientific Applications Unit Weka
What Is Good Clustering?
Clustering Wei Wang.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Topic 5: Cluster Analysis
Presentation transcript:

October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada

October 27, 2015Data Mining: Concepts and Techniques2 Chapter 7. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

October 27, 2015Data Mining: Concepts and Techniques4 General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns

October 27, 2015Data Mining: Concepts and Techniques5 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

October 27, 2015Data Mining: Concepts and Techniques6 What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

October 27, 2015Data Mining: Concepts and Techniques7 Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

October 27, 2015Data Mining: Concepts and Techniques8 Chapter 8. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

October 27, 2015Data Mining: Concepts and Techniques9 Data Structures Data matrix (two modes) Dissimilarity matrix (one mode)

October 27, 2015Data Mining: Concepts and Techniques10 Measure the Quality of Clustering Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.

October 27, 2015Data Mining: Concepts and Techniques11 Type of data in clustering analysis Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types:

October 27, 2015Data Mining: Concepts and Techniques12 Interval-valued variables Standardize data Calculate the mean absolute deviation: where Calculate the standardized measurement (z-score) Using mean absolute deviation is more robust than using standard deviation

October 27, 2015Data Mining: Concepts and Techniques13 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p-dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance

October 27, 2015Data Mining: Concepts and Techniques14 Similarity and Dissimilarity Between Objects (Cont.) If q = 2, d is Euclidean distance: Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.

October 27, 2015Data Mining: Concepts and Techniques15 Binary Variables A contingency table for binary data Simple matching coefficient (invariant, if the binary variable is symmetric): Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object i Object j

October 27, 2015Data Mining: Concepts and Techniques16 Dissimilarity between Binary Variables Example gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0

October 27, 2015Data Mining: Concepts and Techniques17 References (1) R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2: , D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

October 27, 2015Data Mining: Concepts and Techniques18 References (2) L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, P. Michaud. Clustering techniques. Future Generation Computer systems, 13, R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc Int. Conf. on Pattern Recognition, G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98. W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.

October 27, 2015Data Mining: Concepts and Techniques19 Thank you !!!