Download presentation
Presentation is loading. Please wait.
Published byJanice Morrison Modified over 9 years ago
1
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University
2
Contents Researches of Document Clustering Possible Applications of Document Clustering Document Clustering in a Networked Environment Conclusions
3
Researches of Document Clustering
4
Document Clustering Definition –Documents with some similar properties are assigned into automatically created groups Importance –To improve the efficiency and effectiveness of retrieval Time Space Quality –To determine the structure of the literatures of a field Exploration of latent information of documents Reduction of users’ cognition load
5
Block Diagram Document Set Feature Extraction Features Determining Clustering Parameters Clustering Clustered Documents Applications Cluster Structure Nonhierarchical Hierarchical Halting Criteria Number of Desired Clusters Number of Iteration
6
Researches on Document Clustering Features to represent documents –Linguistic structure in documents Co-occurrences of Terms Semantic structure –Meta-data of documents Authors Citation Co-citation: document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents
7
Researches on Document Clustering Measures of relevance between documents –Highly depending on the choice of features to represent documents –Several relevance measures Vector space model (VSM, Salton) Latent semantic indexing (LSI, Schütze) –Based on Singular Value Decomposition (SVD) algorithm –Reduction of dimensions of feature vectors in VSM –Exploiting latent semantic feature of documents Measure of relevance between document d i and d j w ik and w jk : weights of the kth term in d i and d j Frequency of the kth term in d i Inverse document frequency of the kth term L: vocabulary size
8
Researches on Document Clustering Clustering algorithms –Agglomerative hierarchical clustering algorithm (AHC) Algorithm 1. Put each document in the collection into one cluster 2. Identify the two closet clusters and combine these two clusters as a new cluster 3. Repeat Step 2 until that the halting criteria arrive O(N 2 ) –K-Means algorithm O(NK) –Buckshot algorithm Fast, linear time algorithm A K-Means algorithm where the initial cluster centroids are created by applying AHC to a sample of the document in the collection
9
Possible Applications of Document Clustering
10
Query Routing Documents distributed in several information servers –Relevant documents are clustered and put in one or proximate servers –Generating description to represent all of documents in a cluster When retrieval takes place –Identifying relevant clusters based on the relevance between queries and description of clusters –Forwarding queries to the servers for those clusters –Merging the results An example Query: document clustering Library ScienceComputer ScienceZoologyGeology
11
Cluster-based Browsing The problems of expressing a vague information need as a formal query Scatter/Gather (Cutting, et. al., SIGIR’92) –Clustering documents into topic-coherent groups –Presenting descriptive summaries of the clusters to users –Users can browse and determine possible clusters hierarchy –Documents in the selected clusters are clustered and summaries are generated –Finally, documents are retrieved Library ScienceComputer ScienceZoologyGeology Information Retrieval Library Automation
12
Result Set Clustering Users’ queries are often very short (about 1-3 words) –Result set included relevant documents and also irrelevant documents Clustering documents in the result set according to the degree of relevance –Helping users figure out their real information needs –Easily retrieving relevant documents An example Query: Multimedia HypermediaVirtual RealityVideo
13
Result Set Expansion Relevant documents may not match the input queries well Clustering relevant documents based on sophisticated features and clustering algorithms in data preparing phase Retrieving a core set of documents that match the query Expanding the results with documents not matching the query but clustered with the documents in the core set Query Core Set Expanding Result Set
14
Query Refinement Terms in queries do not match the information needs of users Dynamically computing and suggesting recall- and precision- enhancing terms for a given query Term suggestion –Grouping retrieved documents into topic-cohesive clusters –Terms in centroid documents: general concepts –Term in margin documents: specific concepts
15
Document Clustering in a Networked Environment
16
Web Pages vs. Plain Texts Lexical distributions of these two kinds of documents are significant different –Web pages including more proper nouns and terms but less verbs Information in web pages may be in a multimedia form –Difficult to represent and retrieve nowadays Web pages contain rich link information –More than 90% web pages include tags –Each web page contains 15 links in average Inapplicable to use term-based clustering techniques for plain texts to cluster web pages Link structure provides useful information to determine relevance among web pages
17
HTML Tags in Web Pages Tags provide helpful information to understand the meaning expressed by the pages –Tags for web composition Bold, Italic, Underline, Font –Tags for document structures Title Header Headline,, List Items, –Tags for link structures across pages Anchor –Terms with tags are information which the authors think important Terms with tags could be weighted to enhance effectiveness of retrieval
18
An Example of Web Page Anchor Text List Item Tag
19
Connectivity Analysis A link between two pages establishes a relation between the two pages The similarity between two pages could be estimated using –The length of the shortest path between the two pages –The length between the two pages and their least common ancestor –The length between the two pages and their greatest common descendants A DCB JIHEFG E is more similar to A than D
20
Information of Link Structure Authority page: One contains a lot of information about the topic –Authority: If a page p has a link to page q, the authors of page p confer authority on q –link popularity page authority Hub page: One has links to authority pages Mutually reinforcing relationship –A good hub page points to many good authority pages –A good authority page is pointed to by many good hub pages HubsAuthorities
21
Information of Anchor Text The text around links pointing to a page is often a description of the page –The information of anchor text could be used to determine the relevance of the link Distribution of “Yahoo” in anchor texts of 5000 web pages pointing to Yahoo! From: http://decweb.ethz.ch/WWW7/1898/com1898.htm
22
Conclusions
23
Document clustering is an important technique to improve efficiency and effectiveness in information retrieval –Possible applications are wide Technologies of document clustering –Extraction of features to represent documents –Relevance functions between documents –Clustering algorithms Retrieval of web information rely more and more on the information of the web structure
24
Important References P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, 24(5), 577-597. E. Rasmussen, “Clustering Algorithms,” Information Retrieval: Data Structures and Algorithms, ed. by W. B. Frakes and R. Baeza-Yates, Chap. 16, 419-442. D. R. Cutting, D. Karger and J. O. Pedersen, “A Cluster-based Approach to Browsing Large Document Collection,” Proceedings of SIGIR’92, 318-329. J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, IBM Research Report RJ 10076, May, 1997.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.