Clustering.

Slides:



Advertisements
Similar presentations
Mathematical Preliminaries
Advertisements

Advanced Piloting Cruise Plot.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Configuration management
Randomized Algorithms Randomized Algorithms CS648 1.
Projects Project 4 due tonight Project 5 (last project!) out later
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
ABC Technology Project
Hash Tables.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
K-means algorithm 1)Pick a number (k) of cluster centers 2)Assign every gene to its nearest cluster center 3)Move each cluster center to the mean of its.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
1 Using one or more of your senses to gather information.
Januar MDMDFSSMDMDFSSS
Analyzing Genes and Genomes
Systems Analysis and Design in a Changing World, Fifth Edition
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
PSSA Preparation.
Essential Cell Biology
Basics of Statistical Estimation
Conceptual Clustering
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering AMCS/CS 340: Data Mining Xiangliang Zhang
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Data Science Algorithms: The Basic Methods
Machine Learning Lecture 9: Clustering
Clustering.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

Clustering

Outline Introduction K-means clustering Hierarchical clustering: COBWEB

Classification vs. Clustering Classification: Supervised learning: Learns a method for predicting the instance class from pre-labeled (classified) instances

Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data

Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up

Clusters: exclusive vs. overlapping Simple 2-D representation Non-overlapping Venn diagram Overlapping a k j i h g f e d c b

Clustering Evaluation Manual inspection Benchmarking on existing labels Cluster quality measures distance measures high similarity within a cluster, low across clusters

The distance function Simplest case: one numeric attribute A Distance(X,Y) = A(X) – A(Y) Several numeric attributes: Distance(X,Y) = Euclidean distance between X,Y Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Are all attributes equally important? Weighting the attributes might be necessary

Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

K-means example, step 1 k1 Y Pick 3 initial k2 cluster centers (randomly)

K-means example, step 2 k1 Y k2 Assign each point to the closest cluster center k3

K-means example, step 3 X Y k1 k1 k2 Move each cluster center to the mean of each cluster k3 k2 k3

K-means example, step 4 Reassign k1 points Y Reassign points closest to a different new cluster center Q: Which points are reassigned? k1 k3 k2

K-means example, step 4 … X Y k1 A: three points with animation k3 k2

K-means example, step 4b X Y k1 re-compute cluster means k3 k2

K-means example, step 5 k1 Y k2 move cluster centers to cluster means

initial cluster centers Discussion Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum Example: To increase chance of finding global optimum: restart with different random seeds instances initial cluster centers

K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers

K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5

*Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram

*Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility

*Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B True C Overcast D Rainy Mild E Cool Normal F G H I J K L M N 2 3

*Clustering weather data 4 ID Outlook Temp. Humidity Windy A Sunny Hot High False B True C Overcast D Rainy Mild E Cool Normal F G H I J K L M N 5 Merge best host and runner-up 3 Consider splitting the best host if merging doesn’t help

Oops! a and b are actually very similar *Final hierarchy ID Outlook Temp. Humidity Windy A Sunny Hot High False B True C Overcast D Rainy Mild Oops! a and b are actually very similar

*Example: the iris data (subset)

*Clustering with cutoff

*Category utility Category utility: quadratic loss function defined on conditional probabilities: Every instance in different category  numerator becomes maximum number of attributes

*Overfitting-avoidance heuristic If every instance gets put into a different category the numerator becomes (maximal): Where n is number of all possible attribute values. So without k in the denominator of the CU-formula, every cluster would consist of one instance! Maximum value of CU

Other Clustering Approaches EM – probability based clustering Bayesian clustering SOM – self-organizing maps …

Discussion Can interpret clusters by using supervised learning learn a classifier based on clusters Decrease dependence between attributes? pre-processing step E.g. use principal component analysis Can be used to fill in missing values Key advantage of probabilistic clustering: Can estimate likelihood of data Use it to compare different models objectively

Examples of Clustering Applications Marketing: discover customer groups and use them for targeted marketing and re-organization Astronomy: find groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Genomics: finding groups of gene with similar expressions …

Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes Evaluation is a problem