OAIster: Metadata Pointing to Digital Objects Kat Hagedorn Metadata Harvesting/DLXS Librarian University of Michigan Libraries February 18, 2004
background One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public Digital Library Production Service at University of Michigan Libraries began work in December 2001 Launched in June 2002
highlights Any audience Any subject matter Any format Freely accessible No dead ends One-stop shopping …retrieving the “hidden web”
tool we borrowed University of Illinois Urbana-Champaign open-source OAI protocol harvester java edition for our unix environment Worked collaboratively to iron out kinks –resumptionToken / retryAfter –inexplicable kill –bogus records in MySQL table
development environment Digital Library Extension Service (DLXS) Develop open-source middleware and license XPAT search engine for building and mounting digital libraries Middleware consists of document classes, i.e., Text, Image, Bib, FindAid Originally designed to make SGML encoded texts available online
tool we developed Runs in DLXS environment using BibClass Current BibClass web templates modified Additional java-based transformation tool to: –DC metadata records concatenated –No-digital-object records filtered out –Records counted –Conversion from UTF-8 to ISO –XSLT used to transform DC records into BibClass records
system design UIUC harvester Record storage XSLT transformation tool BibClass indexes OAI-enabled DC records Non-OAI- enabled DC records XSL stylesheets (per source type) Search interface (XPAT)
result One place to look for digital objects Big –3,016,251 metadata records –267 institutions (as of last week…) Popular –Averages 3300 search sessions / month –Picked up in March ‘03: average 3500 now –43,894 searches in one year (June 2002 – July 2003)
search
limiters
sort
results
repositories
repositories: e.g., arXiv Eprint Archive: math and physics pre- and post-prints Online Archive of California: manuscripts, photographs, and works of art held in institutions across California Sammelpunkt, Elektronisch Archivierte Theorie: archive of philosophical publications British Women Romantic Poets Project: collection of poems written by British women between 1789 and 1832
repositories: stats As of February ‘04, out of 267 repositories… International and U.S. –U.S.: 50.5% (135) –Intl: 49.5% (132) By subject –Humanities: 24% (65) –Science: 30% (81) –Mixed: 46% (121) E-prints and pre-prints –Using eprints.org software: 39% (104) –Not using eprints.org software: 61% (163)
major issues encountered Metadata variation Records not leading to digital objects Access restrictions on digital objects described in records Duplicate records for a single digital object
issue: metadata variation With more records, users need more restrictions Consistent metadata needed to facilitate these restrictions One option: normalization of data
issue: metadata variation Type: the obvious quick win –240 metadata values mapped to four generic values (text, image, audio, video) –e.g., audio, sound = audio motion, animation, newsreels, etc. = video watercolour, watercolor, slides, etc. = image article, articles, booklet, diss, story, etc. = text
issue: metadata variation Date: where to begin? –Most records with at least one date –Some records include up to seven dates –No consistent style of date Subject: out of context, what meaning? –Many records with at least one subject element –But over 100 records with more than 50 subjects –And one record with 1000!
issue: metadata variation Sample date values between 1827 and ? November 13, 1947 SEP bce Summer, 1948
issue: metadata variation Sample subject values 30,51, , Apr. 22. E[veritt] Judson, letter to Philuta [Judson]. Slavery--United States--Controversial literature view of interior with John Henry sculpture Particles (Nuclear physics) -- Research.
issue: no digital objects Some records contain links to further description of digital object But not the digital object itself Culling difficult One option: add explanatory text to site Or, unfortunately, spot-check and remove repositories with this issue
issue: access restrictions No records where metadata itself is restricted in use (as far as we know!) Definitely some records where objects are restricted to licensed users One option: add explanatory text to site Or sub-set OAIster into free and “partially” free repositories
issue: duplicate records Two records harvested, different identifiers, same object described and pointed to Two records harvested inadvertently through aggregators and original repositories
issue: duplicate records Need algorithm to automate de- duplication Were duplicates to be identified, how to deal with the issue? –Suppress? –Group? –Flag? So far, not addressed in OAIster
future of OAIster Advanced searching Grouping to aid browsing Further normalization of data Handling duplicate records Saving/ ing/downloading records Collaboration with other services: search, instructional… More user testing…
current state of protocol Popular As Peter Suber says: –“…no other single idea or technology in the [open- source movement] has enjoyed this density of endorsement and adoption in a six month period.” Data providers over one year: –June ‘02: 56 repositories / 274,062 records –June ‘03: 187 repositories / 1,246,953 records –Over three-fold increase for repositories –Over four-fold increase for records
future of protocol Branching out –DC required vs. highly recommended –Use of OAI in closed environments –Static repository protocol –OAI-rights committee OAI evangelism
contact info Kat Hagedorn University of Michigan Libraries, Digital Library Production Service