Presentation is loading. Please wait.

Presentation is loading. Please wait.

How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006.

Similar presentations


Presentation on theme: "How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006."— Presentation transcript:

1 How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006

2 Update on UM’s efforts  Built three research portals  DLF  DLF  MODS  MODS  Aquifer  Aquifer  Improvements for search / display  Integration of MODS format records  Simple vs. advanced searching  Inclusion of thumbnails

3 The need to cluster  Want to offer more than search within a generic, large corpus of data  How to partition the data?  Emory’s MetaCombine tool promising as a topical clustering agent  (Also interested in clustering by format, access restriction, OAI software used, etc.)

4 Clustering vs. classification  Clustering is main focus  Huge amount of data  Needed a tool to “find the topic”  Preferably a disjunctive tool (placing files under more than one topic)  Classification is secondary focus  Have potential classification (UM’s browse)  Marrying to current system nigh on impossible

5 Results: duration  First tried with small repository of ~5500 records (amnh)  Took around 25 minutes  Multiple tries with larger repository of ~270K records (dlps)  Took around 12 hours

6 Results: cluster names  Examples of set names from clustering UM’s metadata  Good: “europe”, “mechanical”, “architecture”  Not so good: “general”, “michigan”, “build”  Favorite: “southern literari literature fine messenger”  Granted…  Only asked for 20 clusters  Didn’t cluster hierarchically

7 Caveats  Metadata will always be difficult to cluster  Using a tool developed as a Web service, with obvious benefits  Expect necessity of mapping set names to real topical cluster names

8 What we need  Running the tool locally, with a local WSDL instance, would save lots (and lots) of time  Better set names…does this mean a better algorithm?  Ability to cluster by any criteria, not just topic, i.e., a post-processing module  Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clustering


Download ppt "How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006."

Similar presentations


Ads by Google