Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Similarity/Clustering 인공지능연구실 문홍구 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Data Mining Techniques: Clustering
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Image Segmentation Chapter 14, David A. Forsyth and Jean Ponce, “Computer Vision: A Modern Approach”.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Data mining and machine learning A brief introduction.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
CSE 4705 Artificial Intelligence
Machine Learning Clustering: K-means Supervised Learning
Data Mining K-means Algorithm
Hierarchical Clustering
Data Clustering Michael J. Watts
CSE572, CBS598: Data Mining by H. Liu
Information Organization: Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Clustering Specific Issues related to Project 2

Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory Less time Less noise –Doesn’t have to be particularly sophisticated Get rid of noise –superfluous terms –stop-list Identify important terms –White list –Term weighting?

Project 2 Clustering –Goal: Cluster –Include linguistic documents –Exclude non-linguistic documents –Given: We can represent documents as vectors in multi- dimensioned space –Vectors composed of words, etc. drawn from documents We have a mechanism for measuring the distance between vectors

Project 2 Clustering –Multiple methods –Two main methods: Hierarchical Non-Hierarchical (partitional) –Hierarchical methods can be: Agglomerative (bottom-up) Divisive (top-down)

Clustering Methods Agglomerative (bottom-up) 1.Assume all vectors are in separate clusters 2.Calculate distances between all pairs of vectors, and put in ordered list 3.Iteratively and progressively cluster based on these distances 4.Closest get clustered first, etc. 5.Start again from 2 until all are clustered (or some threshold reached) Divisive (top-down) –Assume all vectors are in one cluster –Calculate distances –Separate based on least coherence, splitting most distant vectors

Clustering Methods Suppose we have vectors {a}, {b}, {c}, {d} We use agglomerative method, and get the following: {a}, {b}, {c, d} In the next iteration, how do we calculate the distance between {a} & {c, d}? We can measure distance between two vectors (e.g. cosine), but clusters?

Cluster Distance Methods for measuring distance between clusters –Single Link –Complete Link –Average Link Average distance between all vectors in two clusters Can be computationally expensive –In worst case, requires calculating distance between each vector in one cluster and each in the other (O(n 2 )) –Centroid distance Measure the similarity between the centroids of the two clusters

Cluster Distance Methods for measuring distance btwn clusters –Single Link: Similarity between two clusters is the similarity of the two closest objects in the cluster “Long and straggly” clusters –Complete Link: Similarity is measured by the similarity of their two most dissimilar members “Tighter” clusters

Cluster Distance Methods for measuring cluster distance –Average link clustering Average distance across clusters Measure average distance –Problem: Average sounds good on the surface, but… Can be computationally expensive (O(n 2 ) – O(n 3 ))

Centroid Centroid of a cluster: Effectively: each component of  is the average of the values for that component for the M points in c.

Remaining Problems Polysemy/homography –Alternate meanings of a term can have a negative effect on clustering –May cause clustering when we don’t want it Synonymy –Terms that are essentially mean the same thing (esp. in a given context & across documents) won’t help clustering

Cluster Reading Parts of Ch 17 in J&M Ch 14 of M&S, esp. first couple of sections Jain & Murty 1999 –Data Clustering: A review – Sparck Jones & Willett (eds) 1997 –Readings in Information Retrieval –In library