CSC 4510/9010: Applied Machine Learning

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Unsupervised learning
Data Mining Techniques: Clustering
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis (1).
What is Cluster Analysis?
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Unsupervised Learning. Supervised learning vs. unsupervised learning.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining and Text Mining. The Standard Data Mining process.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Machine Learning Clustering: K-means Supervised Learning
Clustering Analysis Basics
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Revision (Part II) Ke Chen
Roberto Battiti, Mauro Brunato
Clustering.
Information Organization: Clustering
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Revision (Part II) Ke Chen
CSCI N317 Computation for Scientific Applications Unit Weka
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
Dimensionality Reduction
Presentation transcript:

CSC 4510/9010: Applied Machine Learning Clustering Part 1 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

What is Clustering? Given some instances with data: group instances such that examples within a group are similar examples in different groups are different These groups are clusters Unsupervised learning — the instances do not include a class attribute.

Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt Clustering Example . . . Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 3

A Different Example How would you group 'The price of crude oil has increased significantly’ 'Demand for crude oil outstrips supply' 'Some people do not like the flavor of olive oil' 'The food was very oily' 'Crude oil is in short supply' 'Oil platforms extract crude oil' 'Canola oil is supposed to be healthy' 'Iraq has significant oil reserves' 'There are different types of cooking oil

Another (Familiar) Example

Introduction Real Applications: Emerging Applications Server clustering Grouping customers based on their favours Social networks: discover close-knit cluster Use galaxy cluster to study universal mass 8

In the Real World A technique demanded by many real world tasks Bank/Internet Security: fraud/spam pattern discovery Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species City-planning: Identifying groups of houses according to their house type, value, and geographical location Climate change: understanding earth climate, find patterns of atmospheric and ocean Finance: stock clustering analysis to uncover correlation underlying shares Image Compression/segmentation: coherent pixels grouped Information retrieval/organisation: Google search, topic-based news Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Social network mining: special interest group automatic discovery 9

Clustering Basics Collect examples Compute similarity among examples according to some metric Group examples together such that examples within a cluster are similar examples in different clusters are different Summarize each cluster Sometimes -- assign new instances to the most similar cluster

Measures of Similarity In order to do clustering we need some kind of measure of similarity. This is basically our “critic” Vector of values, depends on domain: documents: bag of words, linguistic features purchases: cost, purchaser data, item data census data: most of what is collected Multiple different measures available

Measures of Similarity Semantic similarity -- but this is hard. Similar attribute counts Number of attributes with the same value. Appropriate for large, sparse vectors, such as BOW. More complex vector comparisons: Euclidian Distance Cosine Similarity

Euclidean Distance Euclidean distance: distance between two measures summed across each feature Squared differences to give more weight to larger difference dist(xi, xj) = sqrt((xi1-xj1)2+(xi2-xj2)2+..+(xin-xjn)2)

Euclidian dist(x1, x2) = sqrt((0-1)2+(3-1)2+..+(2-4)2)=sqrt(9)=3 Calculate differences Ears: pointy? Muzzle: how many inches long? Tail: how many inches long? dist(x1, x2) = sqrt((0-1)2+(3-1)2+..+(2-4)2)=sqrt(9)=3 dist(x1, x3) = sqrt((0-0)2+(3-3)2+..+(2-3)2)=sqrt(1)=1

Cosine similarity measurement Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. Based on home.iitk.ac.in/~mfelixor/Files/non-numeric-Clustering-seminar.pp

Cosine Similarity B(1,4) 4 A(3,2) C(3,3) 3 Muzzle 2 1 1 2 3 4 Tail

Clustering Algorithms Flat K means Hierarchical Bottom up Top down (not common) Probabilistic Expectation Maximumization (E-M)

Partitioning (Flat) Algorithms Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions. Usually too expensive. Effective heuristic methods: K-means algorithm. http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt

K-Means Clustering Simplest hierarchical method, widely used Create clusters based on a centroid; each instance is assigned to the closest centroid K is given as a parameter. Heuristic and iterative

K-Means Clustering Provide number of desired clusters, k. Randomly choose k instances as seeds. Form initial clusters based on these seeds. Calculate the centroid of each cluster. Iterate, repeatedly reallocating instances to closest centroids and calculating the new centroids Stop when clustering converges or after a fixed number of iterations. Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 18

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x Compute centroids Reassign clusters Converged! www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 22

K-Means Tradeoff between having more clusters (better focus within each cluster) and having too many clusters. Overfitting again. Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. The algorithm is sensitive to outliers Data points that are far from other data points. Could be errors in the data recording or some special data points with very different values. http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt

Problem! Poor clusters based on initial seeds https://datasciencelab.wordpress.com/2014/01/15/improved-seeding-for-clustering-with-k-means/

Strengths of K-Means Strengths: Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. Since both k and t are small. k-means is considered a linear algorithm. K-means is most popular clustering algorithm. In practice, performs well, especially on text. www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

K-Means Weaknesses Must choose K; poor choice can lead to poor clusters Clusters may differ in size or density All attributes are weighted Heuristic, based on initial random seeds; clusters may differ from run to run