Download presentation
Presentation is loading. Please wait.
1
Getting a Leg Up on OAI for the NSDL
Naomi Dushay NSDL Core Integration Cornell University
2
What is OAI? Open Archives Initiative … Protocol for Metadata Harvesting (OAI-PMH) intended as an easy way to share metadata over the internet “pull” model of exchange
3
OAI Harvesting OAI query OAI Repository Service Using OAI Harvester
Metadata OAI Harvester Service Using Harvested Metadata OAI query OAI Repository OAI response Metadata OAI Repository OAI query OAI response
4
How Does OAI Work? OAI Protocol runs on top of HTTP
Requests for data encapsulated in URLs: verb]{other arguments as needed} Responses are XML documents
5
Required Know-How HTTP: sending XML responses to HTTP GET and POST requests Web server XML: namespaces (URIs and prefixes), XML Schema validity XML schema validator(s) Possibly XML schema creation Metadata: it depends on your situation
6
OAI and the NSDL Metadata Repository (“union catalog”)
“normalized” metadata with Qualified Dublin Core as its base, to improve: services (e.g. search results, or UI display) metadata quality, when possible predictability of data for re-harvesting services automated harvest/expose model, with OAI at each end
7
OAI in the NSDL Infrastructure
Your collection’s metadata Your collection’s OAI server other OAI Services NSDL Metadata Repository (MR) NSDL MR OAI server Your collection’s metadata, scrubbed & normalized NSDL Search Service NSDL Archive Service
8
Automated MR ingest process
Your collection’s OAI server Validation OAI Harvest NSDL Collection Registration “raw” or “native” metadata Validation Normalize normalized metadata NSDL MR OAI server Metadata Repository Notify collection of problems; May need to halt processing
9
OAI-PMH: Key points OAI-PMH requests are embedded in HTTP
it’s a web request/response service, not a flat file XML, not HTML multiple metadata formats are allowed OAI ≠ simple DC only! Each metadata format MUST have a valid XML schema
10
Metadata Formats and Schemas
XML namespace XML Schema location OAI metadataPrefix Simple Dublin Core, OAI flavor oai_dc Qualified Dublin Core, latest NSDL flavor (As you like; We use “nsdl_dc”) Your format (An appropriate URI) (URL for an XML schema) (As you like)
11
MR ingest requires: compliant OAI 2.0 server
Correct implementation of OAI-PMH: correct responses to all queries Every OAI response must be (deeply) XML schema valid Proper encoding in proper places XML encoding URL encoding UTF-8 encoding
12
OAI 2.0 – Identify baseURL email address(es) protocol version
description for OAI identifier syntax, especially if adhering to oai-identifier syntax described in Implementation Guidelines
13
OAI 2.0 – ListMetadataFormats
correct XML namespace for each format a valid XML schema for each format targetNamespace MUST match XML namespace above super easy out: use oai_dc easy out: use nsdl_dc
14
OAI 2.0 – ListSets super easy out: if all your metadata is NSDL relevant, don’t use sets for our sake. if you want the NSDL to harvest only SOME of your OAI server’s metadata, then use sets. We will harvest only the sets you specify … but our default is to harvest all of them. super easy setSpec strings: use only alpha-num characters
15
OAI 2.0 – ListRecords Every metadata record served must (deeply) validate to its indicated XML schema If used, resumptionTokens must be implemented properly resumptionToken is an exclusive argument Last response has an empty resumptionToken Selective Harvesting must work properly “from” and “until” arguments must limit the results appropriately “set” arguments must limit the results appropriately, if implemented
16
Common Points of Confusion - 1
about the metadata vs. about the resource identifiers: OAI vs. DC record/header/identifier vs. record/metadata/../dc:identifier dates: OAI vs. DC record/header/datestamp vs. record/metadata/../dc:date OAI about containers are about the metadata rights: OAI about vs. DC record/about/../(dc:rights?) vs. record/metadata/../dc:rights
17
OAI identifiers Must uniquely identify individual metadata records at your site for OAI harvest and OAI reharvest Must stay the same for your metadata records metadata is updated; OAI identifier unchanged
18
Common Points of Confusion - 2
Dates format confusion OAI dates must be encoded as ISO8601 and must be in UTC (≈ GMT) OAI-PMH allows YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ. DC date encoding – “Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format.” <responseDate> (All OAI-PMH responses) Time when OAI server responds to a request OAI-PMH sez: ‘must be the time and date of the response in UTC. This is encoded using the "Complete date plus hours, minutes, and seconds" variant of ISO This format is YYYY-MM-DDThh:mm:ssZ.’ <datestamp> (OAI-PMH <record>/<header>) “from” and “until” arguments in OAI requests <dc:date>
19
When a Collection Deletes Records
if not indicated in OAI server incremental harvest for MR never shows update; MR copy never deleted! if indicated in OAI server transiently reharvested soon enough not reharvested soon enough incremental harvest for MR never shows update; MR copy never deleted! if OAI server indicated and persistent MR finds delete on incremental harvest
20
In an ideal world, we’d like
nsdl_dc Information, example records, etc. in the NSDL Metadata Primer Persistent deleted records OAI identifier syntax, per OAI Implementation Guidelines
22
How do we normalize metadata?
Perform “safe” transforms to “smarten up” metadata XSL stylesheets -- from your XML metadata to our normalized XML metadata Principles: Do no harm (Don’t lose information) Add information, when possible Indicate schemes for valid values Remove meaningless text “…”, “not available”, “-” Empty elements Correct wrong information “text/pdf” “application/pdf” Remove characters that impede functionality or display Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Scrub URLs
23
Automated MR Ingest process
Your collection info and harvesting info is registered OAI validation – can we run our harvester on your OAI server? (see handouts) OAI harvest of your metadata (nsdl_dc if available; oai_dc if not; other formats soon) XML schema validation of all of your metadata UTF-8 encoding validation (bad UTF-8 chars changed into harmless ones) Normalized nsdl_dc created. Your metadata, “raw” and normalized, is loaded into the MR tables and made available to the NSDL’s MR OAI server.
24
Deleted Records – Our Solution
“Full reharvest” Mark all the site’s records in MR “deleted” Harvest all metadata records for the collection As we ingest each newly retrieved record into the MR, if we over-write an old record, “un-delete” it. Expensive network bandwidth processing time Okay for small collections (under ~15,000) Okay for metadata that changes infrequently
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.