Download presentation
Presentation is loading. Please wait.
Published byJames Grant Modified over 9 years ago
1
How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006
2
Update on UM’s efforts Built three research portals DLF DLF MODS MODS Aquifer Aquifer Improvements for search / display Integration of MODS format records Simple vs. advanced searching Inclusion of thumbnails
3
The need to cluster Want to offer more than search within a generic, large corpus of data How to partition the data? Emory’s MetaCombine tool promising as a topical clustering agent (Also interested in clustering by format, access restriction, OAI software used, etc.)
4
Clustering vs. classification Clustering is main focus Huge amount of data Needed a tool to “find the topic” Preferably a disjunctive tool (placing files under more than one topic) Classification is secondary focus Have potential classification (UM’s browse) Marrying to current system nigh on impossible
5
Results: duration First tried with small repository of ~5500 records (amnh) Took around 25 minutes Multiple tries with larger repository of ~270K records (dlps) Took around 12 hours
6
Results: cluster names Examples of set names from clustering UM’s metadata Good: “europe”, “mechanical”, “architecture” Not so good: “general”, “michigan”, “build” Favorite: “southern literari literature fine messenger” Granted… Only asked for 20 clusters Didn’t cluster hierarchically
7
Caveats Metadata will always be difficult to cluster Using a tool developed as a Web service, with obvious benefits Expect necessity of mapping set names to real topical cluster names
8
What we need Running the tool locally, with a local WSDL instance, would save lots (and lots) of time Better set names…does this mean a better algorithm? Ability to cluster by any criteria, not just topic, i.e., a post-processing module Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clustering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.