DASISH Metadata Catalogue Binyam Gebrekidan Gebre, Stephanie Roth, Olof Olsson, Catharina Wasner, Matej Durco, Bartholemeus Worcslav, Przemyslaw Lenkiewicz,

Slides:



Advertisements
Similar presentations
The Discovery Landscape in Crystallography UKOLN is supported by: Monica Duke UKOLN, University of Bath, UK – eBank UK project A centre.
Advertisements

Metadata Management at GESIS-ZA Reiner Mauer GESIS – Data Archive and Data Analysis CESSDA-Expert Seminar Odense, September 11th 2008.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Data Management for the SSH, what is SO specific about the SSH? Daan Broeder, MPG-PL.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
From CLARIN Component Metadata to Linked Open Data
Jennifer Bowen, University of Rochester ALA Annual Conference 2009, Chicago, Illinois 1 The eXtensible Catalog's Metadata Services Toolkit Lowering the.
CMDI Interoperability Workshop Daan Broeder TLA / MPI for Psycholinguistics CLARIN NL.
An Leabharlann UCD Órna Roche UCD James Joyce Library Metadata Documenting your data
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
MINT – METADATA INTEROPERABILITY SERVICES Nikolaos Simou – National Technical University of Athens.
Dublin Core as a tool for interoperability Common presentation of data from archives, libraries and museums DC October 2006 Leif Andresen Danish.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
ISO as the metadata standard for Statistics South Africa
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
DOI Registration for Social and Economic Data da|ra Brigitte Hausstein GESIS Leibniz-Institute for the Social Sciences, Berlin.
Metadata Repositories for Interoperable/Shareable Metadata.
Metadata: An Overview Katie Dunn Technology & Metadata Librarian
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Using IESR Ann Apps MIMAS, The University of Manchester, UK.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
PCWG Analysis Tool Peter Stuart September 15, 2015.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Metadata Helen Aristar Dry Eastern Michigan University LINGUIST List.
Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical.
DDI-RDF Leveraging the DDI Model for the Linked Data Web.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
ArcGIS Data Reviewer: An Introduction
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
VO Sandpit, November 2009 CEDA Metadata Steve Donegan/Sam Pepler.
ESIP & Geospatial One-Stop (GOS) Registering ESIP Products and Services with Geospatial One-Stop.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands TLA/MPI requirements for a Semantic Registry.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Metadata Training for SEFSC Science Staff Part Two.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
IULA-UPF repositories: management, integration, how to survive Marta Villegas.
Metadata-based Discovery: Experience in Crystallography UKOLN is supported by: Monica Duke UKOLN, University of Bath, UK A centre of.
Find Research Data b2find.eudat.eu B2FIND Integration How to publish metadata in EUDAT’s B2FIND catalogue This work is licensed under the.
DLF Fall Forum The Distributed Library: OAI for Digital Library Aggregation UIUC’s Role: Registry of OAI Data Providers
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
DLESE Metadata Frameworks March Talk Organizer Terminology DLESE metadata history (DC/IMS to DLESE- IMS to ADN) ADN Collection News-opps Object.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
B2find.eudat.eu EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Activities in a nutshell
An Overview of Data-PASS Shared Catalog
Integrating Data for Archaeology
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Steering Group Member, Link Digital
European VIRTA pilot – eurooppalaisen julkaisutietovirran pilotointi
Heinrich Widmann EUDAT & CKAN Heinrich Widmann
Building Search Systems for Digital Library Collections
B2FIND Integration and Usage
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
WP 5 Shared Data Access & Enrichment
Common Solutions to Common Problems
Session 2: Metadata and Catalogues
Updates on the XSLT stylesheets for DDI
Presentation transcript:

DASISH Metadata Catalogue Binyam Gebrekidan Gebre, Stephanie Roth, Olof Olsson, Catharina Wasner, Matej Durco, Bartholemeus Worcslav, Przemyslaw Lenkiewicz, Kees Jan van de Looij, Daan Broeder UGOT, GESIS, OEAW, MPG-PL

Introduction Our approach to Metdata Catalogue development for SSH disciplines Outcomes Talk outline

Introduction Background CLARIN (VLO for linguistics) EUDAT (B2FIND for several disciplines) Objectives – To investigate metadata availability in the social sciences and humanities (SSH) – To provide a single tool for metadata-based resource discovery, visualization, search for several disciplines in SSH

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 1. Collect a list of metadata providers – challenge: where do we get the list from?

List of metadata providers CESSDA (9 providers) CLARIN (20 providers) DARIAH (25 providers) Total: 54 providers

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 2. Harvest metadata - challenge: it takes time to harvest metadata

Metadata harvesting CESSDA – harvested from 7 out of 9 providers – 49,894 records CLARIN – harvested from 4 out of 20 providers – 160,613 records DARIAH – harvested from 14 out of 25 providers – 302,164 records Total: 25 providers with 512,671 records

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 3. Map to common facets - challenge: which facets and how to map different metadata to these facets

Mapping to 19 facets CESSDA – ddi xml – ddi-2.5.xml – ddi-3.1.xml – datacite-3.0.xml CLARIN – cmdi.xml (heterogeneously structured metadata records) DARIAH – dc.xml – ese.xml Creator Language Creation date Publication date Data provider Country Collection Discipline Subject OAI origin Spatial coverage Temporal coverage Contributor Metadata schema Metadata source Resource type Access [Rights] Community Data format

Which of these is “the creator”? – author – originator – creator – researcher – annotator – recorder We raise the same question for each field/facet – based on the answers we define map rules Mapping - challenges

Map rules Objectives – extensible, easy to modify mapping – not “hardcoded” – editing requires no advanced development skills Chain evaluation of simple rules Types of operations – Select – Combine – Remove duplicates – Conditional action

Map rules CESSDA – ddi – ddi-2.5 – ddi-3.1 – datacite-3.0 DARIAH – dc – ese CLARIN – cmdi allows very heterogeneously structured metadata records – The structures are governed by metadata profiles (annotated by ConceptLinks) – We have a script that generates map files

Map rules + mapper We run the mapper using the map rules for each community We get json (key-value pair) results

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 4. Normalize/harmonize - challenge: how to normalize various spellings of the same concept (e.g. nl, nld, Dutch,Nederlands)

Normalization Dates – (yyyy-mm-dd: UTC format) Country names – (pycountry: ISO 3166) Language names – (iso639-3 language standard) Challenge: – Other facets are normalized using a simple manually filled configuration file – Organization names (e.g. MPI)

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 1. Collect a list of metadata providers 2. Harvest metadata 3. Map to common facets 4. Normalize/harmonize 5. Import into a Metadata Catalogue

Our workflow 5. Import into a Metadata Catalogue - challenge: which catalogue system ? What are the advantages and disadvantages of the selected catalogue?

CKAN is an open source off-the-shelf catalogue developed by the Open Knowledge Foundation – Solr – Postgres database – Python Advantages: – It is open source – Actively developed/improved – Easy to use and adapt – Has a web interface and an API – Has a lot of features (access control, data visualization and analytics, etc.) CKAN

Challenge: – Data importing into CKAN takes a long time if not optimized and if you have many datasets (like in millions) – Optimized: ckan config file – Optimized: postgres database – Optimized: postgres config file Importing into CKAN

Summary Provider OAI-PMH Harvester -> xml files Mapper -> json files Normalizer -> json files Web portal (CKAN) Normalization rules Map rules

Summary Provider OAI-PMH Harvester -> xml files Mapper -> json files Normalizer -> json files Web portal (CKAN) Normalization rules Map rules CLARIN CMDI

List of data providers Selected useful facets (19 of them) Developed tools for – Harvesting – Mapping – Normalization – Concept mapping (map concepts or XPaths to facets) More understanding of CKAN benefits and limitations Source code is open source ( ( Catalogue Demo ( Outcomes

Provided an overview of the available metadata in SSH – metadata providers and schema used Creating mapping and normalization rules are challenging Improving the metadata catalogue quality is a long process (requires much domain expertise and patience). All products will be transferred to EUDAT project (B2FIND) Conclusions

Contributors Olof Olsson (UGOT) Stephanie Roth (UGOT) Catharina Wasner (GESIS) Matej Durco (OEAW) Bartholomäus Wloka (OEAW) Daan Broeder (MPG-PL) Kees Jan van de Looij (MPG-PL) Menzo Windhouwer (MPG-PL) Binyam Gebrekidan Gebre (MPG-PL)

Next: demo

List of data providers CESSDA ( 7 out of 9 providers) – DANS_Easy_Archive (28404 records) – GESIS_via_DataCite (6225 records) – LiDA (546 records) – SND_via_DataCite (2245 records) – the_Swedish_Language_Banks_resources (115 records) – UK_Data_Archive_OAI_Repository (6286 records) – UKDA_via_DataCite (6073 records) CLARIN (4 out of 20 providers) – CLARIN_Centre_Vienna_Language_Resources_Portal (7) – CLARIN_DK_UCPH_Repository (14324) – DANS_CMDI_Provider (1000) – The_Language_Archive_s_IMDI_portal (145282) DARIAH (14 out of 25 providers) – ACDH_Repository – Demo_instance_for_the_imeji_community – Sistory_si_OAI_Repository – 11 others

Example of a harvested file