Heinrich Widmann widmann@dkrz.de EUDAT & CKAN Heinrich Widmann widmann@dkrz.de
EUDAT The project European Data Infrastructure (EUDAT http://eudat.eu ) Motivation : Manage the rising tide of research data Improve Interoperability in a wide cross-disciplinary scope Objective : Build up a Collaborate Data Infrastructure, based on common data services ( https://eudat.eu/services) driven by requirements of the research communities
B2FIND the metadata service of EUDAT (info+doc https://eudat.eu/services/b2find ) based on a comprehensive joint metadata catalogue of research data collections stored in EUDAT data centres and other (external) repositories provides a powerful and user-friendly discovery portal http://b2find.eudat.eu on metadata covering a wide range of research cross-discipline communities b2find.eudat.eu
Used Technologies CentOS 6 (productive instance) Modular Ingestion Workflow Harvesting : OAI-PMH (but as well support of JSONAPI etc.) Own Mapping Module (+ community specific md schemas and ontologies, closed vocabs, …) Upload to CKAN : + common B2FIND MD schema, lot of additional facets (extra fields) Apache + Varnish 3 Cache + CKAN Version 2.2.3 with extensions CKAN itself could harvest OAI-PMH. Why is there a separate mapping and harvesting module?
CKAN extensions ckanext-b2find (+ b2find facets, legal pages etc.) ckanext-spatial (supported by ckan !, but compatibility issues (fixed) ) ckanext-timeline (own development for ‚Temporal coverage‘ on different time scales => makes the usibility quite complex) (how) can be added to supported CKAN extentions ? ‚commitment‘ by CKAN for support and maintanance ? Others interested in further development of this extension ? ckanext-datesearch (PublicationYear) Planned : Support of more extensions, e.g. Use potential of sematic web/LOD ( + dcat, sparql, rdf) Recombinant ??, Kettle ??, …. Improve web appearance : (+ elastic search, …) CKAN itself could harvest OAI-PMH. Why is there a separate mapping and harvesting module?
Issues Scalability / Performance (mostly Postgres related) Status : > 450000 records harvested Upload / indices (re-index lasts > 3 days !) Download / search (esp. When access on PG-DB) Delete (purge!) datasets (often not removed completely from DB+SOLR) Upgrade to newer CKAN versions Compatibility of ckan extentions (spatial, temporal) Compatibility to own schema Decouple upload and serach Two SOLR indices (one ‚read only‘, one ‚write and update‘) ?
Issues (cont.) History ( - how to get rid of it (in PostGres) Something like ‚paster clean history‘ ? Support of Taxonomies/Hierarchies for facets (hierarichal tree of (sub-)disciplines)
Outlook More records from more communities (will this scale with > 1 or 10 millions records ?) Use tools as Kibana and elasticsearch to provide statistics on the fly in dashboard Community customisation (switch between different SOLR cores and adapted search facets) Further search/dessiminate funtionality : annotations, SRU interface
Links Docs : https://eudat.eu/services/userdoc/b2find Portal : http://b2find.eudat.eu Sourcecode : https://github.com/EUDAT-B2FIND Support : https://eudat.eu/support-request Contact : widmann@dkrz.de