Algorithms and applications

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.
K-means Clustering Given a data point v and a set of points X,
Clustering and Probability (Chap 7). Review from Last Lecture Defined the K-means problem for formalizing the notion of clustering. Discussed the K-means.
From W1-S16. Node failure The probability that at least one node failing is: f= 1 – (1-p) n When n =1; then f =p Suppose p= but n=10000, then: f.
Clustering.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Outline Data with gaps clustering on the basis of neuro-fuzzy Kohonen network Adaptive algorithm for probabilistic fuzzy clustering Adaptive probabilistic.
Unsupervised learning
Self Organization: Competitive Learning
Self-Organizing Map (SOM). Unsupervised neural networks, equivalent to clustering. Two layers – input and output – The input layer represents the input.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Color/Intensity
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
What is Cluster Analysis?
Algorithmic Problems in Algebraic Structures Undecidability Paul Bell Supervisor: Dr. Igor Potapov Department of Computer Science
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture 09 Clustering-based Learning
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Vector Quantization Vector quantization is used in many applications such as image and voice compression, voice recognition (in general statistical pattern.
Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Basic Theory (for curve 01). 1.1 Points and Vectors  Real life methods for constructing curves and surfaces often start with points and vectors, which.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Gilad Lerman Math Department, UMN
Unsupervised Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Clustering 1 (Introduction and kmean)
Erich Smith Coleman Platt
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Text Categorization Berlin Chen 2003 Reference:
Statistical Models and Machine Learning Algorithms --Review
Clustering.
Unsupervised Learning
Presentation transcript:

Algorithms and applications Clustering (Chap 7) Algorithms and applications

Introduction Clustering is an important data mining task. Clustering makes it possible to “almost automatically” summarize large amounts of high-dimensional data. Clustering (aka segmentation) is applied in many domains: retail, web, image analysis, bioinformatics, psychology, brain science, medical diagnosis etc.

Key Idea Given: (i) a data set D and (ii) similarity function s Goal: break up D into groups such that items which are similar to each other end in the same group AND items which are dis-similar to each other end up in different groups. Example: customers with similar buying patterns end up in the same cluster. People with similar “biological signals” end in the same cluster. Regions with similar “weather” patterns.

Clustering Example: Iris Data Iris data set is a “very small” data set consisting of attributes of 3 flower types. Flower types: Iris_setosa; Iris_versicolor; Iris-virginica 4 attributes: sepal-length; sepal-width; petal-length; petal-width Thus each flower is “data vector” in a four-dimensional space. The “fifth” dimension is a label. Example: [5.1,3.5,1.4,0.2,iris-setosa] It is hard to visualize in four-dimensions!

Example Iris Data Set Which is the “best” pair of attributes to distinguish the three flower types ?

Attributes and Features In almost all cases the labels are not given but we can acquire attributes and construct new features. Notice the terminology: attributes, features, variables, dimensions – they are often used interchangeably. I prefer to distinguish between attributes and features. We acquire attributes and construct features. Thus there are finitely many attributes but infinitely many possible features. For example, Iris data set had four attributes. Lets construct a new feature: sepal-length/sepal-width.

Similarity and Distance Functions We have already seen the cosine similarity function. sim(x,y) = (x.y)/||x||||y|| We can also define a “distance” function which in some sense is like an “opposite” of similarity. Entities which are highly similar to each other have a small distance between them; while items which are less similar have a large distance between them. Example: dist(x,y) = 1 – sim(x,y) “identical” objects: sim(x,y) = 1  dist(x,y) = 0 sim(x,y) = 0  dist(x,y) = 1

Euclidean Distance Take two data vectors d1 and d2 D1 = (x1,x2,…,xn) D2 = (y1,y2,…yn) Then Euclidean Distance between D1 and D2 is

Euclidean Distance: Examples D1=(1,1,1); D2=(1,1,1) dist(D1,D2) = 0 D1=(1,2,1); D2= (2,1,1)

Simple Clustering: One-dimensional/one-centroid 1 2 6 3 Three data points: 1,2 and 6 Want to summarize with one point The average of (1,2,6) = 9/3 = 3 3 is called the centroid

High-Dimensional/one-centroid Let (2,4), (4,6) and (8,10) be three data points in two dimensions. The average (centroid) is:

Different interpretation of centroid Let x1, x2, x3 be data points. Define a function f as: Now, it turns out that the minimum of the function f occurs at the centroid of the three points. This interpretation can be easily be generalized across dimensions and centroids.

High-dimension/many centroids Let X be a set of d-dimensional data points (data vectors). Find k data points (not necessarily in X, lets call them cluster centers), such that the sum of the square Euclidean distances of each point in X to its nearest cluster center is minimized. This is called the K-means problem. The data points which get assigned to their nearest cluster center end up as the k-clusters.

K-means algorithm Let C = initial k cluster centroids (often selected randomly) Mark C as unstable While <C is unstable> Assign all data points to their nearest centroid in C. Compute the centroids of the points assigned to each element of C. Update C as the set of new centroids. Mark C as stable or unstable by comparing with previous set of centroids. End While Complexity: O(nkdI) n:num of points; k: num of clusters; d: dimension; I: num of iterations Take away: complexity is linear in n.

An important question ? Does the K-means algorithm “solve” the K-means problem. Answer: only partially. In fact nobody has yet been able to design an algorithm to “solve” the K-means problem. Part of the “P =\= NP” problem. A million dollar prize if you have a “solution” or show that the “solution” cannot exist.

Example: 2 Clusters A(-1,2) B(1,2) c c 4 (0,0) C(-1,-2) D(1,-2) c c 2 K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and {C,D} K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then {A,C} and {B,D} end up as the two clusters.

Kmeans algorithm on Iris Data Note: The K-means algorithms does not know the labels!

Kmeans in Matlab In the tutorial you will be working with Matlab Matlab is a very easy to use data processing language For example, the K-means algorithm can be run using one line >> [IDX,C]= kmeans(irisdata,3); The cluster labels (1,2,3) are stored in the variable IDX.

Summary Clustering is a way of compressing data into information from which we can derive “knowledge” The first step is to map data into “data vectors” and then use a “distance” function. We formalized clustering as a K-means problem and then used the K-means algorithm to obtain a solution. The applications of clustering are almost limitless. Almost every data mining exercise begins with clustering – to get initial understanding of the data.