Classification: Cluster Analysis and Related Techniques Tanya, Caroline, Nick.

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering II.
A Data Mining Course for Computer Science and non Computer Science Students Jamil Saquer Computer Science Department Missouri State University Springfield,
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Albert Gatt Corpora and Statistical Methods Lecture 13.
N. Kumar, Asst. Professor of Marketing Database Marketing Cluster Analysis.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
PARTITIONAL CLUSTERING
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Cluster Analysis Measuring latent groups. Cluster Analysis - Discussion Definition Vocabulary Simple Procedure SPSS example ICPSR and hands on.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Unsupervised learning: Clustering Ata Kaban The University of Birmingham
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Description & Analysis of community composition. The individualistic hypothesis Henry Gleason.
Statistical Methods Chichang Jou Tamkang University.
Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Chapter 1: Introduction to Statistics
Chapter 1: Introduction to Statistics
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Presented by Tienwei Tsai July, 2005
CLUSTER ANALYSIS.
Classification. Similarity measures Each ordination or classification method is based (explicitely or implicitely) on some similarity measure (Two possible.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Roberto Battiti, Mauro Brunato
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
Multivariate statistical methods Cluster analysis.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Bivariate Association. Introduction This chapter is about measures of association This chapter is about measures of association These are designed to.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Multivariate statistical methods
1 Chapter 1: Introduction to Statistics. 2 Variables A variable is a characteristic or condition that can change or take on different values. Most research.
Multivariate community analysis
Nearest-Neighbor Classifiers
Clustering and Multidimensional Scaling
Classification (Dis)similarity measures, Resemblance functions
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Cluster analysis Presented by Dr.Chayada Bhadrakom
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Classification: Cluster Analysis and Related Techniques Tanya, Caroline, Nick

Introduction to Classification Search for divisions within data identify groups of individuals with similar characteristics and cluster them togetherSearch for divisions within data identify groups of individuals with similar characteristics and cluster them together Help researchers explore data and generate hypotheses like ordinationHelp researchers explore data and generate hypotheses like ordination –Ordination techniques vs. Classification techniques

Objective ?? What is a cluster?What is a cluster? No formal rule exists for identifying clusters it is subjective; you make the callNo formal rule exists for identifying clusters it is subjective; you make the call

Hierarchical vs. Non-Hierarchical Hierarchical divide data into clusters and looks for relationships between them to create higher order clusters create dendrogramsHierarchical divide data into clusters and looks for relationships between them to create higher order clusters create dendrograms –Dendrograms subdivide a set of individuals into progressively smaller clusters until a stopping condition is encountered Non-hierarchical divide data into clusters without looking at relationships between clustersNon-hierarchical divide data into clusters without looking at relationships between clusters

Dendrogram of Classification Techniques

Hierarchical Techniques Monothetic vs. PolytheticMonothetic vs. Polythetic –Monothetic imposes classifications based on the presence or absence of one attribute at a time Association analysisAssociation analysis –Polythetic uses all information within data Most common modern approachMost common modern approach Cluster analysisCluster analysis TWINSPANTWINSPAN

Cluster Analysis Many procedures and algorithms may be used to create a valid dendrogramMany procedures and algorithms may be used to create a valid dendrogram Similar in technique to Bray-Curtis OrdinationSimilar in technique to Bray-Curtis Ordination Procedure:Procedure: –Square Matrix of Dissimilarities Find lowest distance in matrix Identify pair that generated this Fuse two observations together (First Cluster)

Example

Example

Dissimilarity Matrix

Rules for cluster formation Single- link clustering (AKA Nearest- neighbor clustering)Single- link clustering (AKA Nearest- neighbor clustering) –Clusters are defined by fusing the individual pairs with the smallest distance –Chaining- two individuals ending up in the same cluster despite having a big dissimilarity occurs if linked by closely connected points –Constituent clusters may increase in size gradually with each fusion adding one or small number of elements inconclusive and hard to interpret

Other Rules Complete- Link ClusteringComplete- Link Clustering –Allows fusion between members separated by the greatest distance –Exact opposite of Single Link –May end up separating individuals that are very similar Minimum Variance Clustering (Wards technique) Minimum Variance Clustering (Wards technique) –Intermediate

Interpretation There are NO objective rules for interpreting dendrogramsThere are NO objective rules for interpreting dendrograms Use dendrogram for Hypothesis Formation look for divisions that coincide with existing knowledge about the data Metadata (Chapter 1)Use dendrogram for Hypothesis Formation look for divisions that coincide with existing knowledge about the data Metadata (Chapter 1) Complementary AnalysisComplementary Analysis

Divisive Classification Techniques Takes an entire dataset and divides it into categoriesTakes an entire dataset and divides it into categories As always, the boundaries for these categories is subjectiveAs always, the boundaries for these categories is subjective On a plus though, this forces us to admit that there is some uncertainty which a software package wouldnt tell usOn a plus though, this forces us to admit that there is some uncertainty which a software package wouldnt tell us

TWINSPAN Acronym for Two-way indicator species analysisAcronym for Two-way indicator species analysis Polythetic divisive classification techniquePolythetic divisive classification technique Output is in two-way tablesOutput is in two-way tables

TWINSPAN Tables There are two ordered lists, one for species and one for observationsThere are two ordered lists, one for species and one for observations There are two dendrograms, one to classify species, and one to classify observationsThere are two dendrograms, one to classify species, and one to classify observations Pseudospecies are constructs that convert continuous distributions to a presence/absence (discrete)Pseudospecies are constructs that convert continuous distributions to a presence/absence (discrete)

HOMEWORK!!!!!! 1) What is the difference between Hierarchical and Non- Hierarchical classification technique 2) Define Cluster 3) T/F There can be only one valid dendrogram for a single data set? (Correct if False) **********Bonus********** What is the background of the powerpoint suppose to represent?