Term and Document Clustering Manual thesaurus generation Automatic thesaurus generation Term clustering techniques: –Cliques,connected components,stars,strings.

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Improved TF-IDF Ranker
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Data Mining Techniques: Clustering
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
C++ for Engineers and Scientists Third Edition
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Clustering Unsupervised learning Generating “classes”
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Searching Binding of search statements Boolean queries – Boolean queries in weighted systems – Weighted Boolean queries in non-weighted systems Similarity.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Soft Computing Lecture 14 Clustering and model ART.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
SINGULAR VALUE DECOMPOSITION (SVD)
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Clustering.
Clustering C.Watters CS6403.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
An Efficient Algorithm for Incremental Update of Concept space
Information Retrieval on the World Wide Web
Multimedia Information Retrieval
Information Organization: Clustering
Representation of documents and queries
CS 430: Information Discovery
Automatic Global Analysis
Information Retrieval and Web Design
Presentation transcript:

Term and Document Clustering Manual thesaurus generation Automatic thesaurus generation Term clustering techniques: –Cliques,connected components,stars,strings –Clustering by refinement –One-pass clustering Automatic document clustering Hierarchies of clusters

Introduction Our information database can be viewed as a set of documents indexed by a set of terms This view lends itself to two types of clustering: –Clustering of terms(statistical thesaurus) –Clustering of documents Both types of clustering are applied in the search process: –Term Clustering allows expanding searches with terms that are similar to terms mentioned by the query (increasing recall) –documents clustering allows expanding answers,by including documents that are similar to documents retrieved by a query (increasing recall).

Introduction (cont.) Both kinds of clustering reflect ancient concepts: –Term clusters correspond to thesaurus thesaurus:a “dictionary”that provides for each word,not its definition, but its synonyms and antonyms –Document clusters correspond to the traditional arrangement of books in libraries by their subject Electronic document clustering allows documents to belong to more than one cluster,whereas physical clustering is “one-dimensional”.

Manual Thesaurus Generation The first step is to determine the domain of clustering This helps reduce ambiguities caused by homographs. An important decision in the selection of words to be included; for example,avoiding words with high frequency of occurrence(and hence little information value) The thesaurus creator uses dictionaries and various indexes that are compiled from the document collection: –KWOC(Key Word Out of Context), also called concordance –KWIC(Key Word In Context) –KWAC(Key Word And Context) The terms selected are clustered based on word relationships, and the strength of these relationships, using the judgment of the human creator

KWOC,KWIC,and KWAC Example:The various displays for the sentence “computer design contains memory chips” KWIC and KWAC are useful in resolving homographs KWOC TERMFREQITEM ID chips 2doc2,doc4 computer 3doc1,doc4,doc10 design 1doc4 memory 3 doc3,doc4,doc8,doc12 KWIC chips/computer design contains memory computerdesign contains memory chips/ designcontain memory chips/ computer memorychips/ computer design contains KWAC chips computer design contains memory chips computercomputer design contains memory chips designcomputer design contains memory chips memorycomputer design contains memory chips

Automatic Term Clustering Principle : the more frequently two terms co-occur in the same documents, the more likely they are about the same concept. Easiest to understand within the vector model. Given –A set of documents D t, …, D m –A set of terms that occur in these documents T t, …, T n –For each term Ti and document Dj, a weight wji, indicating how strongly the term represents the document. –A term similarity measure SIM(T i, T j ) expressing the proximity of two terms. The documents, terms and weight can be represented in a matrix where rows are columns are terms. Example of similarity measure : The similarity of two columns is computed by multiplying the corresponding values and accumulating

A matrix representation of 5 documents and 8 terms The similarity between Term1 and Term2, using the previous measure : 0*4 + 3*1 + 3*0 + 0*1 + 2*2 = 7 Example

Automatic Term Clustering(cont.) Next, compute the similarity between every two different terms. –Because this definition of similarity is symmetric (Sim(Ti, Tj) = SIM(Ti, Tj)), we need to compute only n*(n-1)/2 similarities. This data is stored in a Term-Term similarity matrix

Automatic Term Clustering(cont.) Next, choose a threshold that determines if two terms are similar enough to be in the same class. This data is stored in a new binary Term-Term similarity matrix. In this example, the threshold is 10(two terms are similar, if their similarity measure is > 10).

Automatic Term Clustering(cont.) Finally, assign the terms to clusters. Common algorithms :  Cliques  Connected components  Stars  Strings

Graphical Representation The various clustering techniques are easy to visualize using a graph view of the binary Term-Term matrix : T1 T3 T2 T4T5 T6T8 T7

Cliques Cliques require all terms in a cluster(thesaurus class) to be similar to all other terms. In the graph, a clique is a maximal set of nodes, such that each node is directly connected to every other node in the set. Algorithm : 1. i = 1 2. Place Termi in a new class 3. r = k = i Validate if Termk is is within the threshold of all terms in current class 5. If not, k = k If k > n(number of terms) then r = r + 1 if r = n then goto 7 else k = r Create a new class with Termi in it goto 4 else goto 4 7. If current class has only Termi in it and there are other classes with Termi in them then delete current class else i = i If i = n + 1 then goto 9 else goto 2 9. Eliminate any classes that are subsets of(or equal to) other classes

Example(cont.) Classes created : Class1 = (Term1, Term3, Term4, Term6) Class2 = (Term1, Term5) Class3 = (Term2, Term4, Term6) Class4 = (Term2, Term6, Term8) Class5 = (Term7) Not a partition(Term1 and Term6 are in more than one class). Terms that appear in two classes are homographs.

Connected Components Connected components require all terms in a cluster(thesaurus class) to be similar to at least one other term. In the graph, a connected component is a maximal set of nodes, such that each node is reachable from every other node in the set. Algorithm: 1. Select a term not in a class and place it in a new class ( If all terms are in classes, stop) 2. Place in that class all other terms that are similar to it 3. For each term placed in the class, repeat step 2 4. When no new terms are identified in Step 2, goto Step 1 Example : Classes created : Class1 = (Term1, Term3, Term4, Term5, Term6, Term2, Term8) Class2 = (Term7) Algorithm partitions the set of terms into thesaurus classes. Possible that two terms in the same class have similarity 0.

Stars Algorithm : A term not yet in a class is selected, and then all terms similar to it are placed in its class. Many different clustering are possible, depending on the selection of “seed” terms. Example : Assume that the term selected is the lowest numbered not already in a class. Classes created : Class1 = (Term1, Term3, Term4, Term5, Term6) Class2 = (Term2, Term4, Term6, Term8) Class3 = (Term7) Not a partition ( Term4 is in two classes). Algorithm may be modified to create partitions, by excluding any term that has already been selected for a previous class.

Strings Algorithm : 1. Select a term not yet in a class and place it in a new class ( If all terms are in classes, stop) 2. Add to this class a term similar to the selected term and not yet in the class 3. Repeat Step 2 with the new term, until no new terms may be added 4. When no new terms are identified in Step 2, goto Step 1 Many different clusterings are possible, depending on the selections in Step 1 and Step 2. Clusters are not necessarily a partition. Example : Assume that the term selected in either Step 1 or Step 2 is the lowest numbered, and that the term selected in Step 2 may not be in an existing class(assures a partition). Classes created : Class1 = (Term1, Term3, Term4, Term2, Term8, Term6) Class2 = (Term5) Class3 = (Term7)

Summary The clique technique – Produces classes with the strongest relationship among terms. – Classes are strongly associated with concepts. Produces more classes. – Provides highest precision when used for query term expansion. –Most costly to compute.

Summary(cont) The connected component technique – Produces classes with the weakest relationship among terms – Classes are not strongly associated with concepts. – Produces the fewest number of classes. – Maximizes recall when used for query term expansion,but can hurt precision. – Least costly to compute. Other techniques lie between these two extremes.

Clustering by Refinement Algorithm: 1. Determine an initial assignment of terms to classes 2. For each class calculate a centroid 3. Calculate the similarity between every term and every centroid 4. Reassign each term to the class whose centroid is the most similar 5. If terms were reassigned then goto Step2; otherwise stop. Example : Assume the document-term matrix form p.7 Iteration 1: Initial classes and centroids: Class1 = (Term1, Term2) Class2 = (Term3, Term4) Class3 = (Term5, Term6) Centroid1 = (4/2, 4/2, 3/2, 1/2, 4/2) Centroid2 = (0/2, 7/2, 0/2, 3/2, 5/2) Centroid3 = (2/1, 3/2, 3/2, 0/2, 5/2)

Clustering by Refinement(cont.) Term-Class similarities and reassignment: Iteration2 : Revised classes and centroids: Class1 = (Term2, Term7, Term8) Class2 = (Term1, Term3, Term4, Term6) Class3 = (Term5) Centroid1 = (8/3, 2/3, 3/3, 3/3, 4/3) Centroid2 = (2/4, 12/4, 3/4, 3/4, 11/4) Centroid3 = (0/1, 1/1, 3/1, 0/1, 1/1)

Clustering by Refinement(cont.) Term-Class similarities and reassignment : Summary : Process requires less calculations. Number of classes defines at the start and cannot grow. Number of classes can decrease(a class becomes empty). A term may be assigned to a class even if its similarity to that class is very weak(compared to other terms in the class).

One-Pass Clustering Algorithm : 1. Assign the next term to a new class. 2. Compute the centroid of the modified class. 3. Compare the next term to the centroids of all existing classes If the similarity to all existing centroids is less that is a predetermined threshold then goto Step 1 Otherwise, assign this term to the class with the most similar centroid and goto Step 2

One-Pass Clustering Example Term1 = (0,3,3,0,2) Assign Term1 to new Class1. Centroid1 = (0/1,3/1,3/1,0/1,2/1) Term2 = (4,1,0,1,2) Similarity(Term2, Centroid1)=7(below threshold) Assign Term2 to new Class2. Centroid2 = ( c4/1,1/1,0/1,1/1,2/1) Term3 = (0,4,0,0,2) Similarity(Term3, Centroid1)=16(highest) Similarity(Term3, Centroid2)=8 Assign Term3 to Class1. Centroid1 = (0/2,7/2,3/2,0/2,4/2) Term4=(0,3,0,3,3) Similarity (Term4, Centroid1)=16.5(highest) Similarity (Term4, Centroid2)=12 Assign Term4 to Class1. Centroid1=(0/3,10/3,3/3,3/3,7/3)

Example(Cont.) Term5=(0,1,3,0,1) Similarity (Term5, Centroid1)=8.67(below threshold) Similarity (Term5, Centroid2)=3(below threshold) Assign Term5 to new Class3. Centroid3=(0/1,1/1,3/1,0/1,1/1) Term6=(2,2,0,0,4) Similarity (Term6, Centroid1)=13.67 Similarity (Term6, Centroid2)=17(highest) Similarity (Term6, Centroid3)=6 Assign Term6 to Class2. Centroid2=(6/2,3/2,0/2,1/2,6/2) Term7=(1,0,3,2,0) Similarity (Term7, Centroid1)=5(below threshold) Similarity (Term7, Centroid2)=4(below threshold) Similarity (Term7, Centroid3)=9(below threshold) Assign Term7 to new Class4. Centroid4=(1/1,0/1,3/1,2/1,0/1) One-Pass Clustering(Cont.)

One-Pass Clustering (cont.) Example ( cont.) Term8 = ( 3,1,0,0,2 ) Similarity (Term8, Centroid1) = 8 Similarity (Term8, Centroid2) = 16.5 (highest) Similarity (Term8, Centroid3) = 3 Similarity (Term8, Centroid4) = 3 Assign Term8 to Class2. Centroid2 = (9/3, 4/3, 0/3, 1/3, 8/3) Final classes: Class1 = (Term1, Term3, Term4) Class2 = (Term2, Term6, Term8) Class3 = (Term5) Class4 = (Term7) Summary: –Least expensive to calculate. –Classes created depend on the order of processing the terms.

Automatic Document Clustering Techniques are due/to those of automatic term clustering. As before; –A set of documents Dt, …Dm –A set of terms that occur in these documents Tt, …Tn –For each term Ti and document Dj, a weight Wij, indicating how strongly the term represents the document. However, here we use a document similarity measure SIM(Di,Dj) expressing the proximity of two documents. The documents, terms and weights can be represented in a matrix where rows are documents and columns are terms. Example of similarity measure: SIM(Di, Dj) =  Wi1 * Wj1 The similarity of two rows is computed by multiplying the corresponding values and accumulating.

Automatic Document Clustering(cont.) The Document-Document similarity matrix: The binary Document-Document matrix(using threshold 10):

Automatic Item Clustering (cont.) The same clustering techniques would yield: Cliques: Class1 = (Doc1, Doc2, doc5) Class2 = (Doc2, Doc3) Class3 = (Doc2, Doc4, Doc5) Connected components: Class1 = (doc1, Doc2, Doc5, Doc3, Doc4) Stars: Class1 = (Doc1, Doc2, Doc5) Class2 = (Doc2, Doc3, Doc4, Doc5) Strings: Class1 = (Doc1, Doc2, Doc3) Class2 = (Doc2, Doc3) Class3 = (Doc4, Doc5) Clustering by refinement: initial: Class1 = (Doc1, Doc3) Class2 = (Doc2, Doc4) Final: Class1 = (Doc1) Class2 = (Doc2, Doc3, Doc4, Doc5)

Cluster hierarchies General idea: The initial set of clusters is clustered into “second-level” clusters, and so on. A new level is created if the number of clusters at the current level is considered too large. Until a “root” object is created for the entire collection of documents or terms. Centroids - Documents - Similarity between clusters: Defined as similarity between every object in one cluster and every object in the other cluster. Can be approximated by the similarity between the corresponding centroids.

Cluster hierarchies(cont.) Benefits: –Reduces search overhead by performing top-down searches, where at each level only the centroids of clusters of clusters are compared with the search object. –Having found an object of interest, users can expand the search, to see other objects in the containing cluster (this holds for nonhierarchical clustering as well). –Can be used to provide a compact visual representation of the information space. Practicality: –More useful for creation document hierarchies than for creation term hierarchies. –Automatic creation of term hierarchies(hierarchical statistical thesauri0 introduces too many errors.