Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
PARTITIONAL CLUSTERING
Unsupervised learning
Introduction to Bioinformatics
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
I256 Applied Natural Language Processing Fall 2009
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Advanced Multimedia Text Classification Tamara Berg.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Machine Learning Overview Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Data mining and machine learning A brief introduction.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Clustering.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Data Mining and Text Mining. The Standard Data Mining process.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning: Clustering
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Data Clustering Michael J. Watts
Information Organization: Clustering
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning: Clustering
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

Advanced Multimedia Text Clustering Tamara Berg

Reminder - Classification Given some labeled training documents Determine the best label for a test (query) document

What if we don’t have labeled data? We can’t do classification.

What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure.

What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure. – Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

Any of the similarity metrics we talked about before (SSD, angle between vectors)

Document Clustering Clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar.

Source: Hinrich Schutze

Google news Flickr Clusters

Source: Hinrich Schutze

How to cluster Documents

Reminder - Vector Space Model  Documents are represented as vectors in term space  A vector distance/similarity measure between two documents is used to compare documents Slide from Mitch Marcus

Document Vectors: One location for each word. novagalaxy heat h’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

Document Vectors novagalaxy heat h’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids Slide from Mitch Marcus

TF x IDF Calculation Slide from Mitch Marcus W1W2W3…Wn A

Features F1F2F3…Fn A Define whatever features you like: Length of longest string of CAP’s Number of $’s Useful words for the task …

Similarity between documents A = [ ]; G = [ ]; E = [ ]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ? Which pair of documents are the most similar?

Source: Hinrich Schutze

source: Dan Klein

K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik

K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik

source: Dan Klein

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! If we don’t care about a few docs switching back and forth, then convergence is usually fast (< iterations). Source: Hinrich Schutze

source: Dan Klein

Source: Hinrich Schutze

Hierarchical clustering strategies Agglomerative clustering Start with each point in a separate cluster At each iteration, merge two of the “closest” clusters Divisive clustering Start with all points grouped into a single cluster At each iteration, split the “largest” cluster source: Svetlana Lazebnik

source: Dan Klein

Divisive Clustering Top-down (instead of bottom-up as in Agglomerative Clustering) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Source: Hinrich Schutze

Flat or hierarchical clustering? For high efficiency, use flat clustering (e.g. k means) For deterministic results: hierarchical clustering When a hierarchical structure is desired: hierarchical algorithm Hierarchical clustering can also be applied if K cannot be predetermined (can start without knowing K) Source: Hinrich Schutze

For Thurs Read Chapter 6 of textbook