Data catalogues and the data repository ADMIRe JISC MRD Dr Tom Parsons March 2013 Sunday, November 11, 2018 ADMIRe
A world-class university One of the world’s top 100 universities, Nottingham is recognised globally for ground-breaking research and teaching excellence. 40,000 students from more than 150 countries, two overseas campuses and strong links with universities around the world Heavily focused on research: Medical & Health Sciences, Sciences, Engineering, Social Sciences and Arts Large research income (£100m) – primarily RCUK, UK/EU government, commercial and charities Sunday, November 11, 2018 ADMIRe
Key priorities for ADMIRe: RDM policy “1.5. The University will provide mechanisms and services for storage, backup, registration, deposit, retention and preservation of research data assets in support of current and future access, during and after completion of research projects.” Key priorities for ADMIRe: Is the current provision good enough? Where are the gaps? What do we need to provide? Sunday, November 11, 2018 ADMIRe
Understanding requirements Approaches: Survey (summer 2012) Focus groups (November 2012) Interviews (May 2012 onwards) Mixture of ADMIRe, in-house, JISC MRD & Sero Outputs: service model, detailed requirements catalogue, logical models & prototype Institutional requirements: “Enterprise Architecture compliant”, use and integrate with existing systems Sunday, November 11, 2018 ADMIRe
Survey results: Types of data Sunday, November 11, 2018 ADMIRe
Survey results: Data storage Sunday, November 11, 2018 ADMIRe
Survey results: Metadata… Sunday, November 11, 2018 ADMIRe
Sharing data? Sunday, November 11, 2018 ADMIRe
Survey results: Total research data estimates From the survey’s 366 responses 75 Gb average (mean/frequency) Sunday, November 11, 2018 ADMIRe
Total research data estimates 75 Gb average x approx. numbers of PIs & post-grads (4000) = 300TB (+-90%) Large number of unknowns A large amount of data, a large amount of files and a good case for managing it Sunday, November 11, 2018 ADMIRe
Focus groups to understand more Five Faculty based focus groups (30 people in total) Based upon California Digital Library model Sunday, November 11, 2018 ADMIRe
Active data Sunday, November 11, 2018 ADMIRe
Archive data Sunday, November 11, 2018 ADMIRe
Preservation activities Function Actors Req. Freq R S A 1 – Tag Enter metadata describing a bag of research data assets M 2 – Bag Zip the data files up in a bag C 3 – Transfer + Transfer a bag to archival storage 4 – Ingest Ingest a bag in to storage 5 – Update Update (enhance, correct) metadata for a stored bag O L 6 – GetDOI Get (public, private) DOIs for designated assets 7 – Publish Publish assets appropriately on landing pages 8 – Relocate Relocate assets and update locators 9 – Search Search for assets by keyword or field H 10 – Access Access metadata and data according to permissions 11 – Notify Notify actors automatically about data events P 12 – Annotate Create notes about a bag or its contents 13 - Check Check (verify) that the contents of a bag are in order 14 – Report Run reports on aspects of the system (DOI, bag, user) 15 - Administer Administer permissions and system parameters Sunday, November 11, 2018 ADMIRe
Mapping requirements
Where are we now? Sunday, November 11, 2018 ADMIRe
Interfaces/Integrations Direct Users Solution Description Scope Interfaces/Integrations Direct Users Data Retention Platform A storage platform that enables storage of “unstructured” data files. BPM Metastorm frontend. Storage of files and very basic (file type, size, retention period, user) AD to support access. (Note that Open Access will be supported by providing a persistent account used by the Research data web site server that has read only access to all “Open” data sets. Researchers Research data search and retrieve web site Web Site. Expected to be CMS or possibly SharePoint Web site with relevant information and screens to search and return results 1. Data Retention Platform via REST to enable http(s) data transfer. 2. FAST (embedded function) to allow search from a web page. 3. Equella (API) to expose metadata onto search results. 4. Active Directory/LDAP to authenticate file access Those searching for data sets Equella Metadata Database Stores metadata See Metastorm, FAST and Research Web Site N/A FAST Search Engine Provides search results and rich search functionality on the metadata 1. Potential federation to Primo 2. Crawl of Equella Anyone Baggit File collection tool Tool to assist researchers in selecting and bringing files into a collection Linked to from Metastorm PI
Interfaces/Integrations Direct Users Solution Description Scope Interfaces/Integrations Direct Users DMP Online On line tool providing support for creating Data Management plan that is managed to ensure Research Council Requirements are met Used to create Data Management Plan 1. Metastorm will link this within curation workflow 2. Metastorm will take the XML output of this and read key fileds directly to automate some metadata creation in Equella 3. Metastorm will save the output file of this tool PI DOI On line tool for creating a unique digital object identifier Workflow to fork out to this system to allow researcher to create a persistent object identifier. See Metastorm Active File Services File services primarily for storage of active (ie not curated) files The source of files for curation (“Bagging”). Selectable by browsing using Baggit tool. “Other Repository” Sometimes Selectable by browsing using Baggit tool as the source of files for curation (“Bagging”). However these may be databases or alternative repositories that are used instead. If used, and where possible, the DOI will point to these.
ADMIRe Phasing: Drop 1 (to June 2013) Objective: Deliver Key Functions but without over integration Deliverables: 1. Instructions and links on web site on how and why to use DMP Online 2. Instructions and links on web site on how and why to use DOI 3. Implementation (but not integration) of Baggit for Research users 4. Delivery of Metadata in Equella Including instructions and links on web site on how and why to use 5. Creation of Research Data Search Page Implementation of FAST search crawl Embed of FAST in web page Delivery of Results page to include relevant information 6. Metastorm development that: Creates User (PI Researcher) interface to Equella Provides fields to add all metadata into Equella Including Research Project Information, Subject Specific Information, Technical Metadata Allows Researcher to choose when a page is searchable Sunday, November 11, 2018 ADMIRe
ADMIRe Phasing: Drop 2 (to Dec 2013) Deliverables 1. Delivery of Retention platform Delivered outside of ADMIRe project 2. Delivery of Open Access Platform (Subset of Retention platform) 3. Definition and Delivery of End to end workflow automation and integration for data management process with a vision of “Input Once” Integrations of Baggit, Agresso Awards Management, DMP Online, DOI 4. Definition and Delivery of a report for Research Councils that Confirms project adherence (at Project close) to funding requirements for data management and access Enables non-conformance to be addressed Sunday, November 11, 2018 ADMIRe
Reusable outputs Focus groups/interview formats Requirements catalogue Use cases Survey – questions, write-up etc Software? No… Sunday, November 11, 2018 ADMIRe
ADMIRe Project Manager Questions? tom.parsons@nottingham.ac.uk ADMIRe Project Manager Sunday, November 11, 2018 ADMIRe