Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop
Content Domain & Scope Organizational embedding Further requirements Services for e-research with PIDs 2008 CNRI Handle System Workshop
Domain & Scope Reliable references & citations of web accessible resources Language resource domain –Audio & video recordings, pictures, primary texts, annotations –Lexica, grammar descriptions, … –Concepts in terminology registries and ontology's –… Number of resources very big, dependent on how you approach the granularity issue References and citations –embedded in (web) documents –In data structures –In DBs –… 2008 CNRI Handle System Workshop
CLARIN Common Language Resources and Technology Infrastructure The CLARIN project is a large-scale pan- European collaborative effort to create, coordinate and make language resources and technology available and readily useable. As one of its goals CLARIN will create a federation of LR repositories and aims to create a unified resource registry using persistent identifiers CNRI Handle System Workshop
CLARIN Common Language Resources and Technology Infrastructure Preparatory phase (Construction phase ) European dimension (ICT FP7) –112 members from 35 countries, –Prep. Phase Funded with 4.2 ME National dimension: –Funding until now 6.5 ME, more to come –… 2008 CNRI Handle System Workshop
DAM-LR Distributed Access Management for Language Resources (Small 4 partners) European Project aimed at federation building in LR repository domain, Unified metadata catalogue Identity federation using Shibboleth Single resource identifier system for all published resources using the Handle System 2008 CNRI Handle System Workshop
Developed special tools Mover –Updates Handle DB + catalogue –Updates metadata XML files* Restore operations –Recreate the Handle DB (and others) from scratch Lessons learned –Fed. Tech not for all organizations Lund archive R MPI archive R primary 1839 sec primary INL archive R primary R R R R R sec sec DAM-LR HS infrastructure
User benefits
MPG Max-Planck Society Proposal within the MPG to support a MPG wide PID registration service based on the HS. Run by MPG computing center GWDG Will also give support for non-MPG German scientific organizations and (hopefully) CLARIN CNRI Handle System Workshop
Requirements (Political) Independence: European GHR mirror & proxy + no single point of failure Wide(r) acceptance of PID scheme Support for object part addressing, from ISO TC37/SC4 CITER work. Support for (secure) management of resource copies 2008 CNRI Handle System Workshop
proxy MPI archive Class A R primary 1839 primary Archive Class C R R R R CLARIN PID Infrastructure sec. … sec. … 1839/R1 GHR mirror 1111/R5 sec PID registration service
PID Scheme Difficult to gain acceptance –Without PID syntax being official –W3C seems to have problems with anything else but HTTP (see recent XRI events) Can the HS user community help? Possibly only acceptance via urlified handles: Perhaps follow ARK for elegance: – CNRI Handle System Workshop
A y x z Wasteful to issue a pid for each part (think of 100k entries in a lexicon). So use part identifiers. Resolver can make an adequate translation A#z -> objectA?part=z This requires enough flexibility from the resolver to accommodate the object server. The syntax of Z should be standard for the specific data type. Loan from existing fragment identifier syntax standards. 1839/A 1839/x 1839/y 1839/z 1839/A: /A#x, 1839/A#y, 1839/A#z pid resolver object server 1839/A#z /A A y x z z 2008 CNRI Handle System Workshop PIDs & Resource Parts
Lund archive R MPI archive copy 10050/R -> primary 1839 primary R What if MPI moves the resource copy? MPI should have wrt access to the Lund Handle record This would enable changing the Lund URL record too! -> move LHS Access monitor MPI Manager R 2008 CNRI Handle System Workshop Resource duplicates
Lund archive R MPI archive R copy 10050/R -> primary 1839 primary R indirect handles* TYPE = URL –IE-Plugin: ok. –HS proxy: not-ok TYPE = HS_ALIAS (problem*) –IE-Plugin: ok. –HS-Proxy ok Status of 1839/Rcpy handle? –Use in documents? -> hdl:1839/Rcpy 1839/Rcpy -> MPI Manager move Resource duplicates 2008 CNRI Handle System Workshop
Possible Added PID Services Establishing resource authenticity Resource Collection Registration Resource Citation Information Lost Resource Detective … 2008 CNRI Handle System Workshop
Collection Registration Service Much scientific works depends on seemingly accidental distributed collections of material that has no independent embodiment. Needs to be citable with one single PID –encode the collections resource uris directly in a handle record –attach a link to a map of the collections uris Compare recent Aggregation Map concept from ORE 2008 CNRI Handle System Workshop
Citation Information Service (Collections of) resources need to be cited in documents. Acknowledgement & credit also important for primary scientific data E.g. Dutch Spoken Corpus, © Institute for Dutch Lexicography, …. Make this citation information part of the with the PID associated metadata CNRI Handle System Workshop
Establishing Provenance If by accident the handle URI mapping was not properly maintained, special metadata could be available from the handle record to establish its location or find a copy. –URI history, Repository, Depositor, … Labor intensive Only for limited number of resources unless there is a pattern 2008 CNRI Handle System Workshop Lost Resource Detective
2008 CNRI Handle System Workshop The End
Integration it should be an optional extension Make sure HS is not SPF IMDI/LAT SW functions also without HS Issue handles for objects Only for local resources Need special tools Mover –Updates Handle DB + catalogue –Updates IMDI XML files* Restore operations –Recreate the Handle DB (and others) from scratch MPI1001# mpi_url 1839/087-D mpi_url LHS LAT webapps sync Handle DB catalogue mover IMDI harvester CC SSSSS C DAM-LR HS infrastructure