How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006
Update on UM’s efforts Built three research portals DLF DLF MODS MODS Aquifer Aquifer Improvements for search / display Integration of MODS format records Simple vs. advanced searching Inclusion of thumbnails
The need to cluster Want to offer more than search within a generic, large corpus of data How to partition the data? Emory’s MetaCombine tool promising as a topical clustering agent (Also interested in clustering by format, access restriction, OAI software used, etc.)
Clustering vs. classification Clustering is main focus Huge amount of data Needed a tool to “find the topic” Preferably a disjunctive tool (placing files under more than one topic) Classification is secondary focus Have potential classification (UM’s browse) Marrying to current system nigh on impossible
Results: duration First tried with small repository of ~5500 records (amnh) Took around 25 minutes Multiple tries with larger repository of ~270K records (dlps) Took around 12 hours
Results: cluster names Examples of set names from clustering UM’s metadata Good: “europe”, “mechanical”, “architecture” Not so good: “general”, “michigan”, “build” Favorite: “southern literari literature fine messenger” Granted… Only asked for 20 clusters Didn’t cluster hierarchically
Caveats Metadata will always be difficult to cluster Using a tool developed as a Web service, with obvious benefits Expect necessity of mapping set names to real topical cluster names
What we need Running the tool locally, with a local WSDL instance, would save lots (and lots) of time Better set names…does this mean a better algorithm? Ability to cluster by any criteria, not just topic, i.e., a post-processing module Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clustering