Using DSpace as a Disciplinary Data Repository

Slides:

Advertisements

Similar presentations

Building Support for a Discipline-Based Data Repository Ryan Scherle 1, Sarah Carrier 2, Jane Greenberg 2, Hilmar Lapp 1, Abbey Thompson 2, Todd Vision.

Advertisements

Ryan Scherle and Jane Greenberg. A Repository of Data Underlying Journal Articles.

The Dryad Data Repository Ryan Scherle National Evolutionary Synthesis Center.

Evolutionary biology Population genetics Systematics Paleontology Botany and Zoology Genomics Ecology Medicine Agriculture Anthropology Bioinformatics.

The Dryad Data Repository Ryan Scherle 1, Hilmar Lapp 1, Amol Bapat 2, Sarah Carrier 2, Jane Greenberg 2, Peggy Schaeffer 1, Todd Vision 1,3, Hollie White.

Business Development Suit Presented by Thomas Mathews.

Data archiving in evolutionary biology Michael Whitlock.

1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.

R EALLY [ ] S TRATEGIES It’s all about the content XML That Pays Off for Your Content Database “It’s all about the content.” Lisa Bos

There is a certain way that an HTML file should be set up. The HTML section declares a beginning and an ending. Within the HTML, there should be a HEAD.

This is the first page of the log in, this is were you enter your unique details.

Management, marketing and population of repositories Morag Greig, University of Glasgow.

Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.

Engineering a New Home EMILY STENBERG DIGITAL PUBLISHING & PRESERVATION LIBRARIAN LAUREN TODD ENGINEERING SUBJECT LIBRARIAN WASHINGTON UNIVERSITY IN ST.

Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.

Research Data Management At the Smithsonian Using SIdora Nano Tech Working Group May 15, 2014.

Practical Advice Morag Greig Advocacy William J Nixon Service Development DAEDALUS Workshop – 27 June 2005.

The repositories Landscape: where are Repositories now and what’s around the corner? UKDA-store Louise Corti UKDA, University of Essex MIMAS OPEN FORUM.

Supporting scientific communities by publishing data Dryad Digital Repository Peggy Schaeffer OpenAIRE/LIBER Workshop May 28, 2013 Ghent, Belgium.

SharePoint document libraries I: Introduction to sharing files Sharjah Higher Colleges of Technology presents:

Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.

CiNii Articles is a service that provides information on scholastic articles, with an emphasis on Japanese papers. It allows users to find the articles.

Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Office Server Specific Web content management –Page structure, layouts, and controls –Publishing.

January 2005MERLOT Reusable Learning Design Guidelines OVERVIEW FOR MERLOT Copyright 2005 Reusable Learning This work is licensed under a Attribution-NoDerivs-NonCommercial.

Introduction to Archon for CARLI Members Jen Masciadrelli, Library Systems Coordinator, CARLI Office Sarah Horowitz, Special Collections Librarian, Augustana.

Research Data Management At the Smithsonian Using Sidora CNI December 10, 2013.

Dataset citation Clickable link to Dataset in the archive Sarah Callaghan (NCAS-BADC) and the NERC Data Citation and Publication team

UML (Unified Modeling Language)

Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,

Research Data Management At the Smithsonian PASIG, Washington, DC May 24, 2013.

Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.

Sharing OERs via Jorum Siobhán Burke and Sarah Currier 12 th December 2012.

Microsoft Virtual Academy Jamie McAllister | SharePoint MVP & Solution Architect Rob Latino | Program Manager in Office 365 Support.

Merit JISC Collections Merit: presentation for UKCORR Hugh Look, Project Director.

GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.

8 November 2012, Penn State Harrisburg Linda Friend University Libraries Publishing & Curation Services.

Introduction to SHERPA RoMEO and its Significance for Publishers

Step-by-step Demo

Metataxis Can you really implement taxonomies in native SharePoint? Marc Stephenson March 2017.

Helping you succeed in promoting your club

Full Text Finder Publication Finder Overview

Summon® 2.0 Discovery Reinvented

Quick guide < Keyword search >

About me Civil engineer (not in IT) and self-taught developer

Moving on : Repository Services after the RAE

Jarek Nabrzyski Director, Center for Research Computing

Content Management.

‘openLandscapes’ The Knowledge Collection of Landscapes Science

The Hosted Model Charl Roberts Good morning again,

SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.

Institutional role in supporting open access, open science, open data

VI-SEEM Data Repository

What’s New in Colectica 5.3 Part 1

SpringerLink Training August 2010

Sophia Lafferty-hess | research data manager

Module 6: Preparing for RDA ...

Data Management: Documentation & Metadata

New Functionality in ARIN Online

Open Access to your Research Papers and Data

Quick guide < Keyword search >

Research Data Management

Repository Platforms for Research Data Interest Group: Requirements, Gaps, Capabilities, and Progress Robert R. Downs1, 1 NASA.

Planning and Storyboarding a Web Site

Inside a PMI Online Course

Digitization Standards: Issues & Updates

Policy Frameworks: building a firm foundation for your IR

Microsoft Office Illustrated Fundamentals

Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.

APE EAD3 introduction - DARIAH - Brussels

Presentation transcript:

Using DSpace as a Disciplinary Data Repository Ryan Scherle National Evolutionary Synthesis Center

NESCent is funded by NSF, and jointly run by Duke, UNC, and NC State. One of only 3 synthesis centers in the US.

NESCent’s Mission Support synthetic research Develop informatics tools Increase public understanding of science Promote a culture of data sharing Data sharing = promote open source, build a repository

A Repository of Data Underlying Journal Articles *** Repository != database *** Data != publications *** Journal articles => relationships with publishers, collect primary data as it is produced. **** Coupling of publication and data submission.

Dryad Partners The project is led by NESCent and the UNC Metadata Research Center. Additional partners include University of New Mexico (integration w/ Knowledge Network for Biocomplexity), Yale (integration w/ TreeBASE), and NCSU (LOCKSS replication, server management). Primary funding comes from NSF, with IMLS support for one targeted project. NSF grants supporting Dryad include: NSF #EF-0423641 NSF #DBI-0743720 NSF #DBI-0753138

Databases in Biology GenBank MorphBank Morphobank PaleoDB Phylota Protein Data Bank TreeBASE Tree of Life AntWeb FishBase FlyBase HerpNet MaNIS ORNIS WormBase ZFIN Biology has a long history of databases, but they are specialized to store a single type of data or data about a single type of organism. There are many more – I made this list off the top of my head without really trying, and I’m not even a biologist.

The Goal Store all data underlying publications in evolutionary biology, ecology, and related disciplines, at the time of publication. GenBank TreeBASE Dryad ccaattggct gttcttcgat tctggcgagt Archiving at publication time is the only model that has been proven to work. Let’s say I went to the nearest river and started collecting frogs. I might sequence some DNA, measure various characteristics of the frogs, and use the data to build a phylogenetic tree illustrating the relationships between the species I found. I write a paper and send it off to be published. The sequences could be deposited in GenBank, and the tree could be deposited in TreeBASE. But what if someone wants to replicate my work? Where would they find the leg measurements I made? Or the latitude/longitude of my collection sites? This is the type of data Dryad wants to capture.

(Riju et al., 2007) hdl:10255/dryad.158 This is what most people think of when they imagine data from biology – DNA sequences. Here, we have a sequence with many possible mutation points indicated. (Riju et al., 2007) hdl:10255/dryad.158 (Riju et al., 2007) hdl:10255/dryad.158

(Payne et al., 2008) hdl:10255/dryad.222 Typical synthesis data, in spreadsheet form. The work of many researchers has been combined to investigate the size of the largest living organism at various points in history. (Payne et al., 2008) hdl:10255/dryad.222 (Payne et al., 2008) hdl:10255/dryad.222

(Sidlauskas 2007) hdl:10255/dryad.23 Add pictures of biologists and their data -- sidlauskas, mcclain, It’s very diverse, so for now we’re just archiving, and we will build more detailed access/analysis tools later But I didn’t come to talk about data, I came to talk about dspace (Sidlauskas 2007) hdl:10255/dryad.23 (Sidlauskas 2007) hdl:10255/dryad.23

(Taylor and Naish 2007) hdl:10255/dryad.31 (Taylor and Naish 2007)

(Price et al., 2004) hdl:10255/dryad.82 Other than Excel, the most common data format in evolutionary biology is Nexus, an ASCII format that can store genetic sequences and evolutionary (phylogenetic) trees. (Price et al., 2004) hdl:10255/dryad.82 (Price et al., 2004) hdl:10255/dryad.82

Joint Data Archiving Policy Deposit at time of publication Repeatability Embargo Exceptions Coordination <<This journal>> requires, as a condition for publication, that data used in the paper should be archived in an appropriate public archive, such as <<list of approved archives here>>. The data should be given with sufficient details that, together with the contents of the paper, allow each result in the published paper to be re-created. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as the location of endangered species. Full list of partner journals at http://datadryad.org/partners

Why DSpace? Aren’t data objects usually stored in Fedora? User registration Submission system Administrative interface Search/browse system Manakin Speed of initial implementation Data objects in Fedora – of course more universities are starting to use Dspace for data, but as far as I know, none of these are dedicated to data. Initially, cultural change is more important than the data itself. We needed to get something up quickly while we had the community’s interest. The submission system must be easy! Unfortunately, Dspace/Manakin has a steep learning curve Fedora would have taken longer to get to version 0.5, but perhaps the same time for 1.0 No one is more exicted about DuraSpace than I am; I miss a lot of the features of Fedora!

Disciplinary repositories Don’t serve the needs of a single institution Lack a formal organization No formal structure The repository is the “organization” Must locate pockets of dedicated users Must integrate with other community resources our "audience" is researchers in the discipline, and we must implement features that appeal to them (it can be argued that the current dspace is optimized for appealing to librarians and university administrators) integration with other repositories is paramount --- people want to be able to find things in the other repositories they use. if we're not connected to the other repositories, people will say "why should I put it in yet another database/repository?" (ref the talk on "Biology doesn't need another database")

Data repositories Customized metadata fields are required Data often lacks complete metadata Publications can provide context Data comes in a wide variety of formats Connections to other data are valuable

We performed many minor tweaks Hid communities and collections -- Collections/communities model doesn’t fit well for an unstructured community Modified handles -- Needed to add a bit of branding to handles. Expanded metadata fields -- Added metadata fields that were useful to the community (thinking about removing many standard fields to reduce curator confusion. Added surrounding static pages -- Surrounding static pages provide instructions for usage and allow additional repository functionality (Dry-ed)

Embargo

Embargo We contracted with @mire for some of this functionality.

Search modifications Again, these modifications were performed by @mire. Search results grouping More configurable advanced search OpenURL/COinS support

The default DSpace submission system is infamous The default DSpace submission system is infamous. Even with the Configurable Submission System: too complex for the average user lots of metadata fields are off-putting doesn’t handle relationships between objects Simplify, simplify, simplify! (we're dealing with biologists)

(screenshot – submission summary page) Submission “from scratch” is greatly simplified from the normal Dspace Our basic submission system collapses 7 “steps” into 3 (there are still a few steps, but they’re hidden within the 3 pages). Journal integration is even more simplified –the journal sends an acceptance letter with a link to the submission system, and all bibliographic metadata is automatically imported.

Submission system For data entry, all they have to do is upload the file – everything else is automatic or optional. Note that authors, keywords are automatically inherited from publication, but can be changed. Ongoing research for automated metadata Eventually, we want the author to get an email, say “duhh… ok”, and everything else is automatic from there This is not the ideal implementation, we will better integrate w/ dspace in the future

What’s next? OAI harvesting Versioning Authority control Ontology integration Curation interface Faceted search Replication services Tagging, annotation Integration with more journals Integration with partner repositories More submission enhancements Data-specific analysis tools Linking to outside sources – with a data repo, the linkages are more important. Articles have a certain type of structure, people typically just want to read the article. With data, people want to read the articlt, but they also want to know about related datasets Bending Dspace to your will is possible, but even with the latest improvements (Manakin, Config Submission), it’s still a serious pain. We're aware the community will move forward, but will always be tangential to our needs, and we’re willing to accept some extra overhead, but we can’t keep up this level of involvement forever. We’re putting a lot of hope into 2.0 and a long-term roadmap that is based more on the Fedora model. Better developer documentation is needed, especially how to hack Manakin. Too much of what I know comes from Caveat Lector and random mailing list posts. (But I'm as guilty as anyone, because I've learned a lot that I haven't posted yet) We want to implement new functionality so it best fits into the DSpace roadmap (@mire has been very helpful for this)

Broader Collaboration Remember, data benefits from being remixed *** Dryad is part of the D1 cooperative **** integrating the suite of "investigator toolkit” **** D1 will want input from the community

To learn more… Repository: http://datadryad.org Project info: http://datadryad.org/wiki Source code: http://dryad.googlecode.com