Basic Kmeans Clustering and Optimization

Slides:

Advertisements

Similar presentations

Algorithms and applications

Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Least squares CS1114

X-Informatics Clustering to Parallel Computing February 18 and Geoffrey Fox

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Motion Analysis (contd.) Slides are from RPI Registration Class.

CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.

Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.

Clustering Color/Intensity

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Unsupervised Learning

Cluster Analysis (1).

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

What is Cluster Analysis?

Clustering and greedy algorithms Prof. Noah Snavely CS1114

Collaborative Filtering Matrix Factorization Approach

Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.

X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems III: Algorithms Continued and Software February Geoffrey Fox.

Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

CSIE Dept., National Taiwan Univ., Taiwan

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

Image segmentation Prof. Noah Snavely CS1114

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.

Machine Learning Queens College Lecture 7: Clustering.

Optimization Problems

Chapter 13 (Prototype Methods and Nearest-Neighbors )

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

Optimization Indiana University July Geoffrey Fox

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.

COMP24111 Machine Learning K-means Clustering Ke Chen.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Optimization Problems

Semi-Supervised Clustering

Machine Learning Clustering: K-means Supervised Learning

Heuristic Optimization Methods

Ke Chen Reading: [7.3, EA], [9.1, CMB]

Data Mining K-means Algorithm

Unsupervised Learning

Haim Kaplan and Uri Zwick

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Data Mining (and machine learning)

AIM: Clustering the Data together

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Advanced Artificial Intelligence

CSSE463: Image Recognition Day 23

Collaborative Filtering Matrix Factorization Approach

Ke Chen Reading: [7.3, EA], [9.1, CMB]

Optimization Problems

CSE 589 Applied Algorithms Spring 1999

Clustering 77B Recommender Systems

School of Computer Science & Engineering

Simple Kmeans Examples

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Indiana University July Geoffrey Fox

Text Categorization Berlin Chen 2003 Reference:

CSSE463: Image Recognition Day 23

More on HW 2 (due Jan 26) Again, it must be in Python 2.7.

Recommendation Systems

CSSE463: Image Recognition Day 23

Approximation Algorithms

Presentation transcript:

Basic Kmeans Clustering and Optimization September 25 2018 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics, Computing and Engineering Digital Science Center Indiana University Bloomington

Kmeans Clustering http://en.wikipedia.org/wiki/Kmeans

1) k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color). 2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. Wikipedia

4) Steps 2 and 3 are repeated until convergence has been reached. But red and green centers (circles) are clearly in wrong place!!! (stars added by me about right for centers) 3) The centroid of each of the k clusters becomes the new mean. Wikipedia

Clustering of Recommender System Example

Next Set of Slides Compares original labelling of points with a fresh clustering into 3 clusters (using deterministic annealing advanced method in paper but k means similar) Fresh clustering has centers marked The 1000 points are identically placed in side by side pictures

Clustering v User Rating I

Clustering v User Rating II

Clustering v User Rating III

Typical K=4 Small Clustering

Typical K=4 Very Large Clustering

Typical K=6 Large Clustering

A second look at a different solution of K=2 Small Clustering Note Symmetry related solutions A second look at a different solution of K=2 Small Clustering

Typical K=2 Small Clustering

Messy K=4 large Clustering

Messy K=4 Small Clustering

Local Optima in Clustering

Getting the wrong answer Remember all things in life – including clustering – are optimization problems Lets say it’s a minimization (change sign if maximizing) k means always ends at a minimum but maybe a local minimum out of which small changes are stable We will see examples of false minima later in course in Kmeans with Python lesson Unhappiness Real Minimum Local Minimum Wrong Answer Initial Position Center Positions

Annealing Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool Allow atoms to move around If system cools too fast it finds a false minima Corresponds to fuzzy algorithms

Annealing T1 T2 T3 T3 < T2 < T1 Objective Function Configuration – Center Positions Y(k) Fixed Temperature – false minima Annealing -- correct minima

Greedy Algorithms Many algorithms like k Means are greedy They consist of iterations Each iterations makes the step that most obviously minimizes function Set of short term optima; not always global optima Same is true of life (politics, wall street)

General Optimization Problem Have a function E that depends on up to billions of parameters Can always make optimization as minimization Often E guaranteed to be positive as sum of squares “Continuous Parameters” – e.g. Cluster centers Expectation Maximization “Discrete Parameters” – e.g. Assignment problems Genetic Algorithms Some methods just use function evaluations Safest methods – Calculate first but not second Derivatives Steepest Descent always gets stuck but always decreases E One dimension – line searches – very easy Fastest Methods – Newton’s method with second derivatives Always diverges in naïve version For least squares, second derivative of E only needs first derivatives of components Constraints Use penalty functions

Some Examples Clustering. Optimize position of clusters to minimize least squares sum of (particle position – center position)2 Analysis of CReSIS snow layer data. Optimize continuous layers to reproduce observed radar data Dimension reduction MDS. Assign 3D position to each point to minimize least sum of (original distance – Euclidean distance)2 for each point pair Web Search. Find web pages that are “nearest your query” Recommender Systems. Find movies (Netflix), books (Amazon) etc. that you want and which will cause you to spend as much money on site that originator will get credit Deep learning: Optimize classification (learning of features) of faces etc. Distances in “funny spaces” often important

Distances in Funny Spaces I In user-based collaborative filtering, we can think of users in a space of dimension N where there are N items and M users. Let i run over items and u over users Then each user is represented as a vector Ui(u) in “item-space” where ratings are vector components. We are looking for users u u’ that are near each other in this space as measured by some distance between Ui(u) and Ui(u’) If u and u’ rate all items then these are “real” vectors but almost always they each only rates a small fraction of items and the number in common is even smaller The “Pearson coefficient” is just one distance measure that can be used Only sum over i rated by u and u’ Last.fm uses this for songs as does Amazon, Netflix

Distances in Funny Spaces II In item-based collaborative filtering, we can think of items in a space of dimension M where there are N items and M users. Let i run over items and u over users Then each item is represented as a vector Ru(i) in “user-space” where ratings are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Ru(i) and Ru(i’) If i and i’ rated by all users then these are “real” vectors but almost always they are each only rated by a small fraction of users and the number in common is even smaller The “Cosine measure” is just one distance measure that can be used Only sum over users u rating both i and i’

Distances in Funny Spaces III In content based recommender systems, we can think of items in a space of dimension M where there are N items and M properties. Let i run over items and p over properties Then each item is represented as a vector Pp(i) in “property-space” where values of properties are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Pp(i) and Rp(i’) Properties could be “reading age” or “character highlighted” or “author” for books Properties can be genre or artist for songs and video Properties can characterize pixel structure for images used in face recognition, driving etc. Pandora uses this for songs (Music Genome) as does Amazon, Netflix

Do we need “real” spaces? Much of AI analysis involves “points” Events in particle physics analysis Users (people) or items (books, jobs, music, other people) These points can be thought of being in a “space” or “bag” Set of all books Set of all physics reactions Set of all Internet users However as in recommender systems where a given user only rates some items, we don’t know “full position” However we can nearly always define a useful distance d(a,b) between points Always d(a,b) >= 0 Usually d(a,b) = d(b,a) Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality

Using Distances The simplest way to use distances is “nearest neighbor algorithms” – given one point, find a set of points near it – cut off by number of identified nearby points and/or distance to initial point Here point is either user or item Another approach is divide space into regions (topics, latent factors) consisting of nearby points This is clustering Also other algorithms like Gaussian mixture models or Latent Semantic Analysis or Latent Dirichlet Allocation which use a more sophisticated model

Clustering in General

Purpose of Clustering This is used in several somewhat different ways First is taking a set of points in a space and divide into distinct groups with (clear) separation Second is dividing points into neighboring points but not groups that are separate There are also latent factor and mixture models which use a different definition of cluster that adds probabilities These are used to group “items” together in content based approaches Further one has methods/applications that work in Real vector spaces Spaces in which there are no vectors but some or all of distances defined

Clusters v. Regions In Lymphocytes clusters are distinct; DA useful Lymphocytes 4D Pathology 54D In Lymphocytes clusters are distinct; DA useful In Pathology, clusters divide space into regions and sophisticated methods like deterministic annealing are probably unnecessary

“Clustering in Pathology Images” Originally 54 features. 8 or 100 clusters

Heuristics http://en.wikipedia.org/wiki/Heuristics

Heuristics (Wikipedia) In computer science, artificial intelligence, and mathematical optimization, a heuristic is a technique designed for solving a problem more quickly when classic methods are too slow, or for finding an approximate solution when classic methods fail to find any exact solution. This is achieved by trading optimality, completeness, accuracy, and/or precision for speed. Results about NP-hardness in theoretical computer science make heuristics the only viable option for a variety of complex optimization problems that need to be routinely solved in real-world applications. Clustering is NP hard although collaborative filtering is just “hard” and computationally intractable for real time use Roughly NP hard means only (known) exact algorithms are exponential and not polynomial in time to solve

Heuristics II Collaborative Filtering (user based) is O(MN) for M customers and N items which is polynomial Clustering is harder to estimate as we don’t have an exact algorithm There is combinatorial estimate KM where we loop over K assignments for each of M items (K number of clusters) This is exponential Note many real world problems are intrinsically “inexact” as Don’t have exact data Don’t have exact problem So heuristics are often/usually just fine and IMHO exponential behavior (NP hard from previous slide) is not so difficult in practice