Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding.

Slides:

Advertisements

Similar presentations

Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Lecture 6 Image Segmentation

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.

Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.

Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Lecture 5: Similarity and Clustering (Chap 4, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

Tree Clustering & COBWEB. Remember: k-Means Clustering.

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

What is Cluster Analysis?

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Clustering Unsupervised learning Generating “classes”

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Text Clustering.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Machine Learning Queens College Lecture 7: Clustering.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.

Flat clustering approaches

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.

Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Data Mining and Text Mining. The Standard Data Mining process.

1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.

Clustering (1) Clustering Similarity measure Hierarchical clustering

Machine Learning Lecture 9: Clustering

Dr. Unnikrishnan P.C. Professor, EEE

CSE572, CBS598: Data Mining by H. Liu

KAIST CS LAB Oh Jong-Hoon

CSE572, CBS572: Data Mining by H. Liu

Dimension reduction : PCA and Clustering

Clustering Techniques

Text Categorization Berlin Chen 2003 Reference:

Clustering Techniques

Clustering The process of grouping samples so that the samples are similar within each group.

Presentation transcript:

Similarity/Clustering 인공지능연구실 문홍구

2 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding approach -self-organizing maps -multidimensional scaling -latent semantic indexing

3 Formulations and Approaches  Partitioning Approaches  One possible goal that we can set up for a clustering algorithm is to partition the document collection into k subsets or clusters D 1,···,D k so as to minimize the intracluster distance or maximize the intracluster resemblance.  Bottom-up clustering  Top-down clustering

4 Formulations and Approaches

5 Distance based  Hierarchical clustering -The tree of hierarchical clustering can be produced  Bottom-up(agglomerative clustering) –start with the individual object and grouping the most similar ones –join cluster with maximum similarity  Top-down(divisive clustering) –start with all the object and divides them into groups in order to maximize within-group similarity –split least coherent part in cluster

6 Three methods in hierarchical clustering  Single-link  Similarity of two most similar members  Complete link  Similarity of two least similar members  Group average  Average similarity between members

7 Single link Clustering  Similarity of two most similar members => O(n 2 )  Locally Coherent  close objects are in the same cluster  Chaining Effect  Because of following a chain of large similarities without taking into account the global context => low global cluster quality

8 Complete link Clustering  Similarity of two least similar members => O(n 3 )  The function focused on global cluster quality  avoids elongated cluster  a/f or b/e is tighter than a/d (tighter cluster are better than ‘straggly’ cluster)

9 Group average agglomerative clustering  Averages similarity between members  The complexity of computing average similarity is O(n 2 )  Average similarities are computed at each time a new group is formed  compromise between single-link and complete-link

10 Comparison  Single-link  Relative efficient  Long straggly clusters –Ellipsoidal cluster  Loosely bound cluster  Complete-link  Tightly bound cluster  Group average  Intermediate between single and complete

11 Distance based  Flat clustering -k – means - k – means 군집방법은 계층적 군집 분석과는 달리 개체가 어느 한 군집에만 속하도록 하는 상호 배반적 군집 방법이다. 이 방법은 군집의 수를 미리 정하고, 각 개체가 어느 군집에 속 하는지를 분석하는 방법으로서 대량의 데이터의 군집분석에 유용하게 이용되는 방법이다.

12 Distance based  k – means

13 Geometric Embedding Approaches  Self - organizing maps  Multidimensional scaling  Latent semantic indexing ★ A different form of partition-based clustering is to identify dense regions in space.

14 Geometric Embedding Approaches  Self - organizing maps(SOMs) - Self – organizing maps are a close cousin to k-means, except that unlike k-means, which is concerned only with determining the association between clusters and documents, the SOM algorithm also embeds the clusters in a low – dimensional space right from the beginning and proceeds in as way that places related clusters close together in that space.

15 SOM : Example SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.

16 Geometric Embedding Approaches  Multidimensional scaling (MDS) - The goal of MDS is to present documents as point in a low – dimensional space (often 2D-3D) such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input

17 Geometric Embedding Approaches  Latent semantic indexing (LSI) - The latent semantic indexing (LSI) method is an attempt to solve the synonymy problem while staying within the vector space model framework

18 Latent semantic indexing (LSI) - k k-dim vector A Documents Terms U d t r DV d SVD TermDocument car auto

19 EM algorithm  A soft version of K-means clustering  both cluster move towards the centroid of all three objects  reach the stable final state

20 EM algorithm(2)  We want to calculate probability P(c j | vector x i )  Assume that cluster i has a normal distribution  Maximum likelihood of the form

21 Procedure of EM  Expectation Step (E)  Compute h ij that is expectation of z ij  Maximization Step (M)