Subject Repositories European collaboration in the international context 28-29 January 2010 Workshop Technical infrastructure & interoperability Benoit Pauwels Université Libre de Bruxelles, Belgium
Workshop plan Theme 1: The Economists Online network of data providers General infrastructure of the EO solution DIDL/MODS: the EO metadata exchange format RDF/XML Admin file: decentralized administration Enrichment of metadata Theme 2: Economists Online and RePEc Pulling metadata from RePEc Pushing metadata to RePEc Contribute to LogEC Use CitEC
Workshop plan Theme (45’) Introduction (BP, 20’) 3 topics for brainstorming (breakout groups,10’) Breakout groups reporting back (all, 15’)
The Economists Online network of data providers Theme 1: The Economists Online network of data providers General infrastructure of the EO solution DIDL/MODS: the EO metadata exchange format RDF/XML Admin file: decentralized administration Enrichment of metadata
Meresco Harvester Crawler Lucene Other portals EO portal Metadata Logs Objects OAI-PMH HTTP Meresco Harvester Crawler Metadata Lucene SRU RePEc OAI-PMH RSS EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
DIDL / MODS NEEO specs Meresco Harvester Crawler SWUP OFI Comm Profile Metadata Logs Objects OAI-PMH HTTP Metadata exchange format DIDL / MODS NEEO specs Meresco Harvester Crawler Metadata Usage metadata exchange format SWUP OFI Comm Profile Lucene SRU RePEc OAI-PMH RSS EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Technical decisions Desired EO functionality Technical decision Facetted search&find experience Normalized/normalizable metadata APA formatted citations Granular metadata Publication list per author Unambiguous identification of authors Full text indexing/searching Unambiguous links to full texts Enrichment of metadata (JEL, datasets, citations, ReDIF) Extensible metadata format
Metadata exchange format XML container structure that can hold semantically distinct metadata descriptive metadata object files (by-ref) splash page enriched metadata JEL full text (by-ref) datasets (by-ref) [ references ] RePEc handle and metadata (by-ref) DIDL Based on existing container structure defined by SurfShare “info:eu-repo” vocabularies (objectfile accessRights, version, ...)
Metadata exchange format Granular descriptive metadata MODS (3.2) Based on existing metadata structure defined by SurfShare “info:eu-repo” vocabularies (publication type, Unambiguous identification of authors DAI – Digital Author Identifier National or institution-unique persistent identifier Solutions not specific to the NEEO project; continuous aim of standardization at a level that surpasses the project
Publication is described as a complex (compound) object DIDL[1] Item[1] Descriptor/Identifier (persistent identifier) Item[1..∞] (of type descriptiveMetadata) Descriptor/type (« descriptiveMetadata ») Component/Resource -- representation by value (XML) Item[0..∞] (of type objectFile) Component/Resource -- representation by ref. (URL) Descriptor/modified Descriptor/type (« objectFile ») Item[0..1] (of type humanStartPage) Descriptor/type (« humanStartPage ») EO Data model Publication is described as a complex (compound) object persistent identifier Aggregation of 3 types of components descriptiveMetadata (MODS) objectFiles humanStartPage Extensible additional items can be stored within the complex object MODS contains Digital Author Identifier (DAI) of EO author
Metadata exchange format Implementations in NEEO DIDL application profile MODS application profile Vocabularies in DIDL and MODS Technical guidelines for project partners Solutions: home-made or with external support ARNO: home-made Dspace: home-made, AtMire Eprints: home-made, ECS-University Of Southampton Fedora: METS/MODS -> DIDL/MODS DigiTool: METS/MARC -> DIDL/MODS
Decentralized registry service XML-RDF file FOAF + NEEO-specific vocabulary maintained by each data provider on a local web server information of institution : name, description, ... OAI baseURL + OAI sets to harvest EO authors: photograph, full name, affiliation, DAI HTTP get and validated by EO Gateway at regular intervals Automated harvesting process Made visible through portal New partner Create admin file Ask for registration at , declaring location and validating admin file If valid, you’re in
Meresco Enrichment service Harvester Crawler Lucene Other portals Metadata Logs Objects OAI-PMH HTTP Meresco Enrichment service Harvester Crawler OAI-PMH Metadata Lucene SRU SRU RePEc OAI-PMH RSS/Atom EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Metadata enrichment “Automated” enrichment – JEL, full-text ES gets records to be enriched from EO, over SRU Based on date of request for enrichment of certain type and version Based on flag set in EO record ES creates enrichment record(s) ES makes enrichment records available to EO, over OAI-PMH EO harvests enrichment records from ES and integrates into original record EO reuses enrichment information in its services: index & present “Manual” enrichment – datasets Partner enters permalink of publication on DVN platform EO PMH-harvests DDI from DVN, and stores by-ref information
Enriched publication LinkedData / SemanticWeb / ORE ready IR / ES EO DIDL[1] PDF Item[1] HTML Descriptor/Identifier (persistent identifier) TXT Descriptor/modified Item[1..∞] (of type descriptiveMetadata) Dataset DDI Item[0..∞] (of type objectFile) LinkedData / SemanticWeb / ORE ready Item[0..1] (of type humanStartPage) Item[0..∞] (of type text) Item[0..∞] (of type enrichedMetadata) Review Item[0..∞] (of type dataset) Descriptor/Identifier (persistent identifier) Descriptor/modified Item[1..∞] (of type descriptiveMetadata) Item[0..∞] (of type review) Item[0..∞] (of type objectFile)
Theme 1: The Economists Online network of data providers BO Group 1: DIDL/MODS Scalable? Implementation by 100s of partners Local experiences from existing partners: implementation issues you want to share? Can this become a standard for exchange of metadata of IR contained publications? Where does this stand next to (flavours of) DC, SWAP,...? BO Group 2: XML Admin file DAI? BO Group 3: Enrichment model Extensibility: vocabulary for semantics of components Manual enrichment: need for enriched submission form, making it easy for people to make enriched publications Automated (JEL, full text): sustainable?
Workshop plan Theme 2: Economists Online and RePEc Pulling metadata from RePEc Pushing metadata to RePEc Contribute to LogEc Use CitEc
RePEc model RePEc archives contain RePEc series contain Working papers, Articles, Books, Book chapters, Software Manually maintained by research centres, journal publishers, university departments all over the world +/- 900 archives, more than 4000 series ReDIF metadata format Network accessible over FTP or HTTP Aggregation by RePEc services: EconPapers IDEAS Central PMH-accessible aggregated archive of AMF formatted metadata
RePEc model Template-type: ReDIF-Paper 1.0 Author-Name: Capron, Henri Author-Email: Author-Name: Meeusen, Wim Author-Email: Author-Name: Dumont, Michel Author-Person: pdu51 Author-Name: Cincera, Michele Author-Person: pci5 Title: National innovation systems: pilot study of the Belgian innovation system Creation-Date: 1998 Publication-Status: Published as a report for the Belgian Federal Office for Scientific, Technical and Cultural Affairs (OSTC) File-URL: File-Format: application/pdf Handle: RePEc:dul:ecoulb:2013-941
RePEc model compared to IR model Very similar BUT RePEc model: Harvests only from “official” publisher repositories Therefore: 1 work exists once in RePEc and it is guaranteed the one and only “official” manifestation of the work IR model: holds publications for which institution is typically not the publisher 1 work 1 official manifestation + multiple author manifestations one work can exist in: one or more repositories as different publication types with different descriptive metadata with different object files attached with different object file metadata Pushing and pulling metadata records from RePEc and IR into one system is bound to raise problems
Pull metadata from RePEc EO harvests AMF formatted metadata records from Overlap !! Same records are harvested from IR and RePEc Solution: XML Admin file contains directive <not-from-repec-series> Permits to specify which RePEc series do not need to be harvested from RePEc, since already delivered through IR BUT: IR contains articles produced by its authors These articles are contained in a journal RePEc series Overlap in EO cannot be avoided
Push metadata to RePEc EO sets up “RePEc:ner” archive, containing ReDIF-X formatted records ReDIF-X All records are delivered as “ReDIF-Paper”, but with extra fields denoting the “real” publication status and version of text Overlap !! Most institutions already maintain RePEc series: these records must not be pushed by EO XML Admin file controls which series to feed in this “ner” archive <feed-repec> boolean: to feed or not to feed <feed-repec-series> If not given: all records with fulltext that are not working papers are mapped to one series for that institution RePEc series OAI setspec of DIDL/MODS record BUT IR inherent problem of multiple copies/versions is pushed to RePEc
Push metadata to RePEc: ReDIF-X Template-type: ReDIF-Paper 1.0 Title: Block investments and the race for corporate control in Belgium Author-Name: Chapelle, Ariane Language: en Note: info:eu-repo/semantics/published X-PublishedAs-Type: article X-PublishedAs-Article-Year: 2004 X-PublishedAs-Article-Journal: Corporate Ownership & Control X-PublishedAs-Article-Volume: 2 X-PublishedAs-Article-Issue: 1 Order-URL: File-URL: File-Format: application/pdf File-Version: authorVersion Handle: RePEc:ulb:ecoulb:2013/9943
LogEc Aim: track abstract views and download clicks of publications presented through RePEc services (EconPapers, IDEAS, ... Economists Online) NOT: tracking of usage at the level of the archives Downloads of publications contained in RePEc archives, initiated through a Google user do not show up in LogEc How: EO logs clicks abstract views and download clicks of object files On a monthly basis, EO transforms these log entries into requested LogEc format, using “” 2009-10 EconomistsOnline RePEc:aah:aarhec:1987-21 a: d: RePEc handle of publication is necessary EO partners delivering content to RePEc directly (and that EO therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record
LogEc RePEc EO RePEc (AMF metadata) RePEc handle DIDL[1] Item[1] Descriptor/Identifier (persistent identifier) Descriptor/modified Item[1..∞] (of type descriptiveMetadata) Item[0..∞] (of type objectFile) Item[0..1] (of type humanStartPage) RePEc (AMF metadata) Item[0..∞] (of type descriptiveMetadata) RePEc handle Descriptor/modified byRef
CitEc Aim: citation analysis for RePEc publications How: Analyze text: extract and parse list of references from publications References are checked whether available in RePEc Cites: references to other RePEc publications Textual references CitedBy Co-citations EO publications (from our IRs) are pushed to RePEc and are therefore pulled through the CitEc processing EO has access to the resulting CitEc data, and presents this through the EO portal (not yet, will be in Feb 2010) RePEc handle of publication is necessary EO partners delivering content to RePEc directly (and that EO therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record
Theme 2: Economists Online and RePEc BO Group 1 : Push/pull to/from RePEc ReDIF-X data structure Duplicates; different versions of identical publication BO Group 2: Publishing models Advantages/disadvantages of RePEc publishing model as opposed to IR publishing model Push the two models together? Do we need to foresee specific services in the gateway or portal to make these two live together in peace? BO Group 3: Future RePEc/EO services What services should EO and RePEc jointly be looking at in the future in the interest of the economics researcher ?