Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop
Harvesting in Europeana: workflow and requirements Best-practices Recommendations Common issues Tools / Software Resources Documentation Table Of Content
1.Determine collections to be contributed Questionnaire Harvesting in Europeana
2.Obtain OAI-PMH repository parameters: –Absolute minimum (enough for fully implemented, tested and documented OAI repositories) Server base URL –Very useful to have: Mapping between described collection(s) and OAI- PMH set(s) Prefix of metadata format to use preferably for Europeana (if not described in ListMetadataFormats response): ex: oai_dc, mods, tel, ese Harvesting in Europeana
3.Configuration of harvester 4.Full harvest with ListRecords request –Records collected in XML files ≤ 10MB –Harvest stored in SVN Harvesting in Europeana
Compliancy to OAI-PMH 2.0 protocol specifications Follow implementation guidelines OAI-PMH v2 for repository implementers Full functional tests!! Best-practices: implementation
OAI validation = Your OAI repository correctly implements the OAI-PMH! Correct response to all OAI-PMH requests: with arguments, various error conditions, every XML schema of every OAI response is valid,... Best-practices: OAI validation
Follow the Open Archive Initiative Protocol Testing Validate your server using the validator supplied by the OAI. Without registering clicking checkbox "only validate and do not register (you may then register later)." Recommended approach to OAI validation
#Protocol_Conformance_Testing
=> bottom of the page
Set = "an optional construct for grouping items for the purpose of selective harvesting.“ Issues and recommendations: sets
Number of obstacles related to sets: Interpreting how a repository has organized sets and determining which sets to harvest –Issue: setName not human understandable and/or no setDescription provided. –Issue: Large number of sets to sort through. Knowing when there are records that belong to no sets –Issue: Items that belong to no sets are included in the OAI repository. Knowing when there are empty sets –Issue: Data provider exposes sets with no records.
Number of obstacles related to sets: Understanding relationships between sets –Issue: Relationships between sets are not expressed. Mechanism to express relationships between hierarchical sets But no mechanism to express relationships between overlapping sets! The only way to know: harvest the identifiers or records which contain the header information sets record belongs to
Number of obstacles related to sets: Knowing how many records there are within a set before harvesting –Issue: Not expressing how many records are within a set which can be expressed via a completeListSize attribute in a resumptionToken or within the set description. Knowing when a set structure has been substantially changed –Issue: Changes in a set structure has not been communicated
No single best practice for the organization of sets. Realistically: data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows. Useful to organize the metadata items into sets according to the collections of resources they represent. –Concept of collections varies and not completely clear in Europeana. –Useful for harvester to understand notion of collection for data providers Sets: recommendations
Repository implementation following OAI- PMH v2.0 + tested Inform Europeana harvesting responsible of any repository changes / maintenance No regular harvesting schema determined yet “SLA” between data providers and harvesters Basic requirements
Unavailability / unreliability of repository server Implementation of OAI-PMH v2 incomplete –resumptionToken not supported –Only ListIdentifiers XML syntax errors Character encoding errors Short lifetime of resumptionToken Common issues
TEL/Europeana OAI-PMH Harvester – Offline documentation –Harvester –Java standalone application with GUI –Multiple harvesting jobs –Resuming unfinished jobs –Logging –No scheduling, No configuration interface Tools / Software
REPOX - Repository + Harvester Java standalone application with web GUI Multiple harvesting jobs, Scheduler Statistics Management of XML metadata repository –Versioning and identification of records –Different metadata format –User interface to create metadata crosswalks: Schema mapper Tools / Software
OAIcat from OCLC Framework conforming to the OAI-PMH v2.0 Repository + Harvesting Java web application Scheduling, logging Limited scalability (~2M records) Tools / Software
Other implementations in different languages to plug-in into a Library Management System: –PHP: OAIbiblio data provider implementation of the OAI-PMH, version 2.0. This toolkit can be easily customized to communicate with an already existing, multi-table MySQL database –PERL: Celestial OAI aggregator/cache application that imports OAI metadata from version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that metadata through either an aggregated or per-repository OAI- compliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl 5.6.x and a CGI-capable web server –Ruby: ruby-oai Includes a client library, a server/provider library and a interactive harvesting shell –Python: pyoai package enables high-level access to an OAI-PMH Metadata Repository and also implements a framework for quickly creating OAI-PMH compliant servers Tools / Software (TELplus D2.1)
ESE XML validation schemas developed by partners Tools / Software
The Open Archives Initiative Protocol for Metadata Harvesting v2.0 col.html col.html TELplus D2.1, “ OAI-PMH implementation and tools guidelines ”, 21 pages –Protocol overview and description of main concepts –OAI-PMH implementation in libraries –References Resources
Wiki “Best Practices for OAI Data Provider Implementations and Shareable Metadata”: Excellent source of guidelines, tutorials, recommendations, implementation softwares and tools, references etc... dex.php/Main_Page dex.php/Main_Page Resources
Requirements: –Europeana OAI-PMH Harvesting –Europeana OAI-PMH Repositories ESE XML validation schema Europeana OAI-PMH data providers registry & forum/mailing list –Local systems –OAI-PMH repository solution –Contact Documentation in Europeana context
Thank you Questions? Remarks?...