CSCI N317 Computation for Scientific Applications Unit Weka

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
CS690L: Clustering References:
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
CLUSTERING (Segmentation)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
What is Cluster Analysis?
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Cluster Analysis Part I
CLUSTER ANALYSIS.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Unsupervised Learning
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining--Clustering
Chapter 15 – Cluster Analysis
Hierarchical Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
Data Mining Chapter 4 Cluster Analysis Part 1
Self organizing networks
Fuzzy Clustering.
Dr. Unnikrishnan P.C. Professor, EEE
Hierarchical and Ensemble Clustering
Clustering and Multidimensional Scaling
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
DATA MINING Introductory and Advanced Topics Part II - Clustering
Clustering John Owen Sarah Smith.
Hierarchical and Ensemble Clustering
What Is Good Clustering?
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Topic 5: Cluster Analysis
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

CSCI N317 Computation for Scientific Applications Unit 3 - 4 Weka Cluster Analysis

What is Cluster Analysis? The purpose of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is also called data segmentation in some applications because clustering partition large data sets into groups according to their similarity. Clustering can be used for outlier detection, where outliers may be more interesting than common cases. Unsupervised learning, learning by observation

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location

Types of Data in Cluster Analysis Typically operate on either of the following two data structures: Data matrix: represents n objects such as persons, with p variables (measurements or attributes), such as age, height, weight and so on. The structure is in the form of an n-by-p matrix

Types of Data in Cluster Analysis Dissimilarity matrix: Stores a collection of proximities that are available for all pairs of n objects. The structure is in the form of an n-by-n matrix, where d(i,j) is the measured difference or dissimilarity between objects i and j. In general, d(i,j) is a nonnegative number that is close to 0 when object i and j are highly similar or “near” each other, and becomes larger the more they differ. d(i,j) = d(j,i) d(i,i) = 0

Types of Data in Cluster Analysis Many clustering algorithm operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying clustering algorithms The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. Different algorithms are developing for computing distances between different types of variables and objects with mixed-type variables.

Cautions for Interval-Scaled Variables Continuous measurements of a roughly linear scale. E.g. weight, height, latitude and longitude coordinates, temperature The measurement units used can affect the clustering analysis. E.g. changing measurements from meters to inches for height may lead to very different clustering structure. Therefore data need to be standardized.

Cautions for Interval-Scaled Variables Standardize data – convert the original measurements of variable f to unitless variables Calculate the mean absolute deviation, Sf: xif, …., xnf are n measurements of f, mf is the mean of f Calculate the standardized measurement or z-score: Whether and how to perform standardization is the choice of the user.

Variables of Mixed Types Example: calculate the dissimilarity using all tests After distance computation, the resulting matrix is Thus, object 1 and 4 are most similar.

Clustering Methods Partitioning methods Given a database of n objects, constructs k partitions of the data, where each partition represents a cluster and k<=n. Each group must contain at least one object, and each object must belong exactly to one group. A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.

Clustering Methods Hierarchical methods Creates a hierarchical decomposition of the given set of data objects. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one, or until a termination condition holds. The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is one cluster, or until a termination condition holds.

Clustering Methods The choice of clustering algorithm depends both on the type of data available and on the particular purpose of the application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose.

Partitioning Methods Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k<=n), where each partition represents a cluster. K-means method Cluster similarity is measured in regard to the mean values of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.

Partitioning Methods Example

Partitioning Methods Example Let k = 3. Arbitrarily choose three objects as the three initial cluster centers, where centers are marked by a “+”. Each object is distributed to a cluster based on the center to which it is the nearest. The cluster centers are updated by calculating the new mean based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Eventually no redistribution of the objects in any cluster occurs, and so the process terminates.

Partitioning Methods K-means method can only be applied when the mean of a cluster is defined, thus cannot be used to data with categorical attributes. K-modes method – extends the k-means paradigm to cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a frequency-based method to update modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with mixed numeric and categorical values. K-means method is sensitive to noise and outlier data points because a small number of such data can substantially influence the mean value

Hierarchical Methods Group data objects into a tree of clusters. Two types: Agglomerative – bottom-up strategy, placing each objects in its own cluster and merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Divisive – top-down strategy, starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained.

Hierarchical Methods

Outlier Discovery What are outliers? The set of objects are considerably dissimilar from the remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ... Problem: Define and find outliers in large data sets Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis

Videos Weka https://www.youtube.com/watch?v=HCA0Z9kL7Hg https://www.youtube.com/watch?v=9aODdNSAauI R https://www.youtube.com/watch?v=sAtnX3UJyN0 Data site used in video - http://archive.ics.uci.edu/ml/ Data file on Canvas: iris.csv Sample code on Canvas: clusterAnalysis.R