1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
CS/Info 430: Information Retrieval
Cluster Analysis: Basic Concepts and Algorithms
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Thesauruses and Cluster Analysis 2.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Information Retrieval Thesauruses and Cluster Analysis 1.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering.
1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Graduate School of Informatics Kyoto University, November 21, 2001 Technologies of the Interspace Peer-Peer Semantic Indexing Bruce Schatz CANIS Laboratory.
Revolution & Kids: Building the Future of the Net & Understanding the Structures of the World Bruce R. Schatz CANIS - Community Systems Laboratory University.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Machine Learning Queens College Lecture 7: Clustering.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 430: Information Discovery Lecture 5 Ranking.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Semi-Supervised Clustering
Chapter 15 – Cluster Analysis
Data Mining K-means Algorithm
K-means and Hierarchical Clustering
Clustering.
Revision (Part II) Ke Chen
Information Organization: Clustering
Revision (Part II) Ke Chen
Text Categorization Berlin Chen 2003 Reference:
Information Retrieval in Digital Libraries: Bringing Search to the Net
SEEM4630 Tutorial 3 – Clustering.
CS 430: Information Discovery
Unsupervised Learning
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2

2 Course Administration

3 Cluster Analysis Methods that divide a set of n objects into m non- overlapping subsets. For information discovery, cluster analysis is applied to terms for thesaurus construction documents to divide into categories (sometimes called automatic classification, but classification usually requires a pre-determined set of categories).

4 Cluster Analysis Metrics  Documents clustered on the basis of a similarity measure calculated from the terms that they contain.  Documents clustered on the basis of co-occurring citations.  Terms clustered on the basis of the documents in which they co-occur.

5 Non-hierarchical and Hierarchical Methods Non-hierarchical methods Elements are divided into m non-overlapping sets where m is predetermined. Hierarchical methods m is varied progressively to create a hierarchy of solutions. Agglomerative methods m is initially equal to n, the total number of elements, where every element is considered to be a cluster with one element. The hierarchy is produced by incrementally combining clusters.

6 Simple Hierarchical Methods: Single Link x x x xx x x x x x x x Similarity between clusters is similarity between most similar elements Concept

7 Single Link A simple agglomerative method. Initially, each element is its own cluster with one element. At each step, calculate the similarity between each pair of clusters as the most similar pair of elements that are not yet in the same cluster. Merge the two clusters that are most similar. May lead to long, straggling clusters (chaining). Very simple computation.

8 Similarities: Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D D D D n

9 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot 0.33 golf Using incidence matrix and dice weighting

10 Example -- single link alpha delta golf bravo echo charlie foxtrot 1 Agglomerative: step 1

11 Example -- single link alpha delta golf bravo echo charlie foxtrot 1 2 Agglomerative: step 2

12 Example -- single link alpha delta golf bravo echo charlie foxtrot Agglomerative: step 3

13 Example -- single link alpha delta golf bravo echo charlie foxtrot This style of diagram is called a dendrogram.

14 Simple Hierarchical Methods: Complete Linkage x x x xx x x x x x x x Similarity between clusters is similarity between least similar elements Concept

15 Complete linkage A simple agglomerative method. Initially, each element is its own cluster with one element. At each step, calculate the similarity between each pair of clusters as the similarity between the least similar pair of elements in the two clusters. Merge the two clusters that are most similar. Generates small, tightly bound clusters

16 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot 0.33 golf Using incidence matrix and dice weighting

17 Example – complete linkage Cluster abcdefg elements Least similar pair / distance a-ab/.2ac/.2ad/.5ae/.2af/.33ag/.5 b-bc/.5bd/.2be/.5bf/.4bg/.2 c-cd/.2ce/.5cf/.4cg/.2 d-de/.2df/.33dg/.5 e-ef/.4eg/.2 f-fg/.33 g- Step 1. Merge clusters {a} and {d}

18 Example – complete linkage Clustera,dbcefg elements Least similar pair / distance a,d-ab/.2ac/.2ae/.2df/.33ag/.5 b-bc/.5be/.5bf/.4bg/.2 c-ce/.5cf/.4cg/.2 e-ef/.4eg/.2 f-fg/.33 g- Step 2. Merge clusters {a,d} and {g}

19 Example – complete linkage Clustera,d,gbcef elements Least similar pair / distance a,d,g-ab/.2ac/.2ae/.2af/.33 b-bc/.5be/.5bf/.4 c-ce/.5cf/.4 e-ef/.4 f- Step 3. Merge clusters {b} and {c}

20 Example – complete linkage Clustera,d,gb,cef elements Least similar pair / distance a,d,g-ab/.2ae/.2af/.33 b,c-be/.5bf/.4 e-ef/.4 f- Step 4. Merge clusters {b,c} and {e}

21 Example -- complete linkage alpha delta golf bravo charlie echo foxtrot Step 1 Step 6 Step 5 Step 2 Step 4 Step 3

22 Non-Hierarchical Methods: K-means 1Define a similarity measure between any two points in the space (e.g., square of distance). 2Choose k points as initial group centroids. 3Assign each object to the group that has the closest centroid. 4When all objects have been assigned, recalculate the positions of the k centroids. 5Repeat Steps 3 and 4 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

23 K-means Iteration converges under a very general set of conditions Results depend on the choice of the k initial centroids Methods can be used to generate a sequence of solutions for k increasing from 1 to n. Note that, in general, the results will not be hierarchical.

24 Problems with cluster analysis in information retrieval  Selection of attributes on which items are clustered  Choice of similarity measure and algorithm  Computational resources  Assessing validity and stability of clusters  Updating clusters as data changes  Method for using the clusters in information retrieval

25 Example 1: Concept Spaces for Scientific Terms Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms. Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May Federating Diverse Collections of Scientific Literature

26 Concept Spaces: Methodology Concept space: A similarity matrix based on co-occurrence of terms. Approach: Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept. Arrange concepts in a hierarchical classification.

27 Concept Spaces: INSPEC Data Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links. [24.5 hours of CPU on 16-node Silicon Graphics supercomputer.] computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education RT teaching

28 Concept Space: Compendex Data Data set 2: (a) 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories. [ Four days of CPU on 64-processor Convex Exemplar.] (b) In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by the existing classification scheme.

29 Objectives Semantic retrieval (using concept spaces for term suggestion) Semantic interoperability (vocabulary switching across subject domains) Semantic indexing (concept identification of document content) Information representation (information units for uniform manipulation)

30 Use of Concept Space: Term Suggestion

31 Future Use of Concept Space: Vocabulary Switching "I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."

32 Example 2: Visual thesaurus for geographic images Methodology: Divide images into small regions. Create a similarity measure based on properties of these images. Use cluster analysis tools to generate clusters of similar images. Provide alternative representations of clusters. Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May Thesaurus.html

33

34 The End Search index Return hits Browse content Return objects Scan results