Tim Brody University of Southampton CiteBase Services 13/07/2001
Content History What is CiteBase Problems: –Searching & Information Retrieval –OAI & Distribution Problems –Usage Questions Future
History Researcher for OpCit summer 2000 – – Started CiteBase as part of my 3rd year project Sept Dienst/Santa Fe Most work done during Spring 2001 Thesis completed May 2001
What is CiteBase Prototype Database (MySQL) MetaData - OAI (arXiv, cogprints) Citation Data - OpCit (arXiv) Ranked searches, a la Google/CiteSeer Static hit data, demo of other criteria re-exports metadata+citation data via OAI, opcit_dc => AMF?
Problems 1 Searching & Information Retrieval –Large data sets, ~170,000 records, 170mb of searchable data, potentially millions (bigger than web?) –Requires custom ranking, not just best match –SQL search is >O(N) –MySQL text-index is too fuzzy –ARC uses Oracle, expensive! –SQL best solution for metadata?
Problems 2 OAI & Distribution Problems –Reliant upon source archives … (XML problems, format, semantic, reliability) –Harvest from when? problem –Identifier change problems/deletion –Redistribution, should datestamps be changed? –Subjects nice idea in practice … –Only texts identified, what about people/institutions/journals? –No clear solution for peer-archives/conflicts
Problems 3 Usage Questions –Should we store full-text? –Who is going to use these services? –Are we all things to all people, or subject specific? –How careful should we be with ranking (how do we prevent abuse)? –What archives do we expose (trust question)? –How to keep citation links up-to-date –Multiple language handling?
Future Implement AMF at source archives Re-assess metadata requirements/storage Find a better solution for I.R.: Cheshire, gnoSearch, Oracle? Prevent abuse (self-citation etc.) Implement usage tracking (hit ranking)