Infrastructures for Using Metadata RSS and OAI-PMH CS 431 – March 14, 2005 Carl Lagoze – Cornell University.

Slides:



Advertisements
Similar presentations
OAI from 50,000 Feet OAI develops and promotes interoperability solutions that aim to facilitate the efficient dissemination of content. Begun in 1999.
Advertisements

A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting Andy Powell UKOLN,
A brief overview of the Open Archives Initiative Steve Hitchcock Open Citation Project (OpCit) Southampton University Prepared for Z39.50/OAI/OpenURL plenary.
Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Building Reliable Distributed Information Spaces Carl Lagoze CS /22/2002.
National Science Digital Library (NSDL) Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
The Open Archives Initiative Simeon Warner (Cornell University) Open Archives seminar “Facilitating Free and Efficient Scientific.
OAI-PMH at Yale Report on the DLF OAI Training Session November 10, 2005 Charlottesville, VA.
The Open Archives Initiative Simeon Warner Cornell University, Ithaca, NY, USA CREPUQ 2002, Montréal, Canada 14:00, 24 October 2002.
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
XML: The Strategic Opportunity Roy Tennant Challenges*  Only librarians like to search, everyone else likes to find  Our users want more information.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Metadata Harvesting Interoperable digital collections.
Metadata Harvesting Interoperable digital collections.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Herbert van de sompel Workshop on OAI and peer review journals in Europe Geneva, Switserland – March 22nd to 24th 2001 Herbert Van de Sompel Cornell University.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting Presenter: Knud Möller Friday,
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.
Metadata harvesting in regional digital libraries in PIONIER Network Cezary Mazurek, Maciej Stroiński, Marcin Werla, Jan Węglarz.
1 A Very Large Digital Library Technology Demonstration William Y. Arms Cornell University.
Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
JISC Information Environment Service Registry (IESR) Ann Apps MIMAS, The University of Manchester, UK.
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Core Integration Web Services Dean Krafft, Cornell University
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library.
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
Metadata and OAI DLESE OAI Workshop April 29-30, 2002 Katy Ginger Presentation available at:
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Metadata and OAI DLESE OAI Workshop June 29 to July 2, 2002 Katy Ginger Presentation available at:
The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Open Archives Initiative Protocol for Metadata Harvesting.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Metadata Harvesting Interoperable digital collections.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Open Archives Initiative CNI Phoenix December 13, 1999 Dale Flecker, Harvard Carl Lagoze, Cornell John Ober, CDL Don Waters, Mellon.
2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?
NSDL & the Open Archives Initiative A Brief Introduction to OAI Timothy W. Cole Mathematics Librarian & Professor of Library Administration.
Introduction to the OAI Protocol for Metadata Harvesting Version 2.0 Hussein Suleman Virginia Tech DLRL 25 March 2002.
RSS Syndication CS 431 – Carl Lagoze – Cornell University.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
OAI and ODL Building Digital Libraries from Components Hussein Suleman Virginia Tech DLRL 12 September 2002.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Harvesting and Exporting Metadata 714: Metadata Margaret E.I. Kipp -
Metadata Harvesting - OAI-PMH
Getting a Leg Up on OAI for the NSDL
Georges Arnaout Chaitanya Krishna
CS431 guest lecture Simeon Warner
OAI and Metadata Harvesting
OAI 11/20/07.
Open Archive Initiative
JISC Information Environment Service Registry (IESR)
IVOA Interoperability Meeting - Boston
Presentation transcript:

Infrastructures for Using Metadata RSS and OAI-PMH CS 431 – March 14, 2005 Carl Lagoze – Cornell University

RSS Format to expose news and content of news-like sites –Wired –Slashdot –Weblogs “News” has very wide meaning –Any dynamic content that can be broken down into discrete items Wiki changes CVS checkins Roles –Provider syndicates by placing an RSS-formated XML file on Web –Aggregator runs RSS-aware program to check feeds for changes

RSS History Original design (0.90) for Netscape for building portals of headlines to news sites –Loosely RDF based Simplified for 0.91 dropping RDF connections RDF branch was continued with namespaces and extensibility in RSS 1.0 Non-RDF branch continued to 2.0 release Alternately called: –Rich Site Summary –RDF Site Summary –Really Simple Syndication

RSS is in wide use All sorts of origins –News –Blogs –Corporate sites –Libraries –Commercial 821http://blogs.law.harvard.edu/tech/2005/01/04#a 821

RSS components Channel –single tag that encloses the main body of the RSS document –Contains metadata about the channel -title, link, description, language, image Item –Channel may contain multiple items –Each item is a “story” –Contains metadata about the story (title, description, etc.) and possible link to the story

RSS 1.0 Example

RSS 2.0 Example

RSS Validation

And of course….

RSS applications Automated discovery of RSS feeds – Aggregators –AmphetaDesk - –NewsGator - –NetNewsWore -

RSS 2.0 and publish and subscribe element of channel Specifies a web service that supports the rssCloud interface which can be implemented in HTTP- POST, XML-RPC or SOAP 1.1 Allow processes to register with a cloud to be notified of updates to the channel via a callback

The Open Archives Initiative (OAI) and the Protocol for Metadata Harvesting (OAI-PMH)

“The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between electronic preprint solutions, as a way to promote their global acceptance. “ (Paul Ginsparg, Rick Luce & Herbert Van de Sompel ) Origins of the OAI

What is the OAI now?  Technological framework around OAI-PMH protocol  Application independent  Independent of economic model for content Also … a community and a “brand” (and you need it for an assignment due in May) “The OAI develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (from OAI mission statement)

OAI Where does the OAI fit? DLESE Earth Science Digital Library EPrintsDSpace Library of Congress arXiv NSDL Metadata Repo. OAIster Search Service

OAI and Open Access There is “A” difference –Open Archives Initiative –Open Access The OAI is not tied to a particular political agenda - technical focus BUT… the OAI provides functionality that is essential for many Open Access proposals

OAI-PMH Data Provider (Repository) Service Provider (Harvester) Protocol requests (GET, POST) XML metadata  PMH -> Protocol for Metadata Harvesting Simple protocol, just 6 verbs Designed to allow harvesting of any XML (meta)data (schema described) For batch-mode not interactive use

OAI for discovery R3 R4 R2 R1 User Information islands ?

OAI for discovery R3 R4 R2 R1 User Metadata harvested by service Search service Service layer

OAI for XYZ R3 R4 R2 R1 User Global network of resources exposing XML data XYZ service Service layer

all available metadata about this sculpture item Dublin Core metadata MARC21 metadata branding metadata records item has identifier record has identifier + metadata format + datestamp OAI-PMH Data Model resource

OAI and Metadata Formats Protocol based on the notion that a record can be described in multiple metadata formats Dublin Core is required for “interoperability” Extended to include XML compound object formats: e.g., METS, DIDL – ndesompel.htmlhttp:// ndesompel.html

OAI-PMH and HTTP OAI-PMH uses HTTP as transport –Encoding OAI-PMH in GET &arg1=... Example: verb=GetRecord& identifier=oai:arXiv.org:hep-th/ & metadataPrefix=oai_dc Error handling  all OK at HTTP level? => 200 OK  something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) HTTP codes 302 (redirect), 503 (retry-after), etc. still available to implementers, but do not represent OAI-PMH events

OAI-PMH verbs FunctionVerb listing of a single recordGetRecord listing of N recordsListRecords OAI unique ids contained in archiveListIdentifiers sets defined by archiveListSets metadata formats supported by archiveListMetadataFormats description of archiveIdentify metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

Library of Congress transient T00:00:00Z YYYY-MM-DDThh:mm:ssZ deflate Identify verb Information about the repository, start any harvest with Identify

….namespace info not shown here T08:55:46Z oai:arXiv:cs/ cs math ….. note no HTTP encoding of the OAI-PMH request GetRecord - Normal response

T08:55:46Z ShowMe is not a valid OAI-PMH verb with errors, only the correct attributes are echoed in Error/exception response Same schema for all responses, including error responses.

Identifiers Items have identifiers (all records of same item share identifier) Identifiers must have URI syntax Unless you can recognize a global URI scheme, identifiers must be assumed to be local to the repository Complete identification of a record is baseURL+identifier+metadataPrefix+datestamp container may be used to express harvesting/transformation history

Selective Harvesting RSS is mainly a “tail” format OAI-PMH is more “grep” like Two “selectors” for harvesting –Date –Set Why not general search? –Out of scope –Not low-barrier –Difficulty in achieving consensus

Datestamps All dates/times are UTC, encoded in ISO8601, Z notation: T20:30:00Z Datestamps may be either fill date/time as above or date only (YYYY-MM- DD). Must be consistent over whole repository, ‘granularity’ specified in Identify response. Earlier version of the protocol specified “local time” which caused lots of misunderstandings. Not good for global interoperability!

Harvesting granularity mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ (must look at Identify response) granularity of from and until agrument in ListIdentifier/ListRecords must match

Sets Simple notion of grouping at the item level to support selective harvesting –Hierarchical set structure –Multiple set membership permitted –E.g: repo has sets A, A:B, A:B:C, D, D:E, D:F If item1 is in A:B then it is in A If item2 is in D:E then it is in D, may also be in D:F Item3 may be in no sets at all

header contains set membership of item oai:arXiv:cs/ cs math ….. eliminates the need for the “double harvest” 1.x required to get all records and all set information Record headers

resumptionToken Protocol supports the notion of partial responses in a very simple way: Response includes a ‘token’ at the which is used to get the next chunk. Idempotency of resumptionToken : return same incomplete list when resumptionToken is reissued while no changes occur in the repo: strict while changes occur in the repo: all items with unchanged datestamp optional attributes for the resumptionToken: expirationDate, completeListSize, cursor

Harvesting strategy Issue Identify request –Check all as expected (validate, version, baseURL, granularity, comporession…) Check sets/metadata formats as necessary (ListSets, ListMetadataFormats) Do harvest, initial complete harvest done with no from and to parameters Subsequent incremental harvests start from datastamp that is responseDate of last response