Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Supervised Learning Recap
Machine Learning and Data Mining Clustering
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Clustering II.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Expectation-Maximization
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Revision (Part II) Ke Chen
Information Organization: Clustering
Revision (Part II) Ke Chen
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Presentation transcript:

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev

SET/IR – W/S 2009 … 13. Clustering …

Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis –Documents in the same cluster are relevant to the same query –How do we use it in practice?

Representations for document clustering Typically: vector-based –Words: “cat”, “dog”, etc. –Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)

Scatter-gather Introduced by Cutting, Karger, and Pedersen Iterative process –Show terms for each cluster –User picks some of them –System produces new clusters Example: – es/sg-example1.htmlhttp:// es/sg-example1.html

k-means Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat Needed: small number k of desired clusters hard decisions Example: Weka

k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

K-means (cont’d) In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.

Example Cluster the following vectors into two groups: –A = –B = –C = –D = –E = –F =

Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank cd /data2/tools/weka export CLASSPATH=$CLASSPATH:./weka.jar java weka.clusterers.SimpleKMeans -t ~/e.arff java weka.clusterers.SimpleKMeans -p 1-2 -t ~/e.arff

Demos htmlhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM. html sterhttp:// ster

Probability and likelihood Example: What is in this case?

Bayesian formulation Posterior ∞ likelihood x prior

E-M algorithms [Dempster et al. 77] Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values. [McCallum & Nigam 98]

E-M algorithm Initialize probability model Repeat –E-step: use the best available current classifier to classify some datapoints –M-step: modify the classifier based on the classes produced by the E-step. Until convergence Soft clustering method

EM example Figure from Chris Bishop

EM example Figure from Chris Bishop

EM example Figure from Chris Bishop

EM example Figure from Chris Bishop

Demos

“Online” centroid method

Centroid method

Online centroid-based clustering sim ≥ Tsim < T

Sample centroids

Evaluation of clustering Formal definition Objective function Purity (considering the majority class in each cluster)

RAND index Accuracy when preserving object-object relationships. RI=(TP+TN)/TP+FP+FN+TN In the example:

RAND index Same cluster Same classTP=20FN=24 FP=20TN=72 RI = 0.68

Hierarchical clustering methods Single-linkage –One common pair is sufficient –disadvantages: long chains Complete-linkage –All pairs have to match –Disadvantages: too conservative Average-linkage Demo

Non-hierarchical methods Also known as flat clustering Centroid method (online) K-means Expectation maximization

Hierarchical clustering Single link produces straggly clusters (e.g., ((12)(56)))

Hierarchical agglomerative clustering Dendrograms /data2/tools/clustering E.g., language similarity:

Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A

Paper reading Mark Newman paper “The structure and function of complex networks” (sections I, II, III, IV, VI, VII, and VIIIa)