Update at CNI, April 14, 2015 Chip German, APTrust Andrew Diamond, APTrust Nathan Tallman, University of Cincinnati Linda Newman, University of Cincinnati.

Slides:

Advertisements

Similar presentations

Deconstructing and Reconstructing a Repository – Strategies for Nimble Rebuilding Linda Newman Head of Digital Collections and Repositories University.

Advertisements

HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.

What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.

IRRA DSpace April 2006 Claire Knowles University of Edinburgh.

Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.

Digital Preservation A Matter of Trust. Context * As of March 5, 2011.

CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.

Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program

OCLC Digital Archive: Creating Long Term Access to Digital Masters Roberta Gebhardt, Montana Historical Society Research Center Sarah McHugh, Montana State.

The future’s so bright…. DAITSS DIGITAL PRESERVATION SYSTEM: RE-ARCHITECTED, RE- WRITTEN, AND OPEN SOURCE Priscilla Caplan Florida Center for Library Automation.

Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.

Technical Framework Charl Roberts University of the Witwatersrand Source: Repositories Support Project (JISC)

The Digital Preservation Network at UT Austin Chris Jordan Texas Advanced Computing Center.

Hydra Partners Meeting March 2012 Bill Branan DuraCloud Technical Lead.

1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.

DCAPE Project Update Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management.

Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.

DSpace Rea Devakos and Gabriela Mircea University of Toronto Libraries.

Archival Prototypes and Lessons Learned Mike Smorul UMIACS.

Agenda  Overview  Configuring the database for basic Backup and Recovery  Backing up your database  Restore and Recovery Operations  Managing your.

An Introduction to DuraCloud Carissa Smith, Partner Specialist Michele Kimpton, Project Director Bill Branan, Lead Software Developer Andrew Woods, Lead.

Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.

High Water Raises All Boats Leveraging Partnerships on Campus to Build a Repository Mary Molinaro University of Kentucky Libraries.

PeDALS Persistent Digital Archives & Library System Richard Pearce-Moses Deputy Director for Technology & Information Resources Arizona State Library,

Finding a New Way Richard Pearce-Moses Deputy Director for Technology & Information Resources Arizona State Library, Archives and Public Records Using.

Persistent Digital Archives and Library System (PeDALS) SC Department of Archives and History.

1 Copyright ©2004 TAC. 2 T-WorMS Adding Sanity to Your Process Jamie L. Mitchell CTO TAC.

ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.

PROJECT HYDRA SNEAK PEAK – ADVANCE SHOWING Brought to you by the Digital Repository Task Force Steve Marine (chair), Ted Baldwin, Dan Gottlieb, Kevin Grace,

DuraCloud pilot program Michele Kimpton, CEO DuraSpace Richard Rodger, Dept Head Software development, M.I.T. Libraries Claire Stewart Dept Head Digital.

Challenges of Digital Media Preservation Karen Cariani, Director Media Library and Archives Dave MacCarn, Chief Technologist.

Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.

Choosing Delivery Software for a Digital Library Jody DeRidder Digital Library Center University of Tennessee.

©2006 Merge eMed. All Rights Reserved. Energize Your Workflow 2006 User Group Meeting May 7-9, 2006 Disaster Recovery Michael Leonard.

OCLC Online Computer Library Center Digital Preservation with OCLC Digitization Standards: Issues & Updates Taylor Surface, OCLC.

Preserving ETDs: NDLTD & MetaArchive Collaboration Gail McMillan Digital Library and Archives, Virginia Tech Newcomers’ USETDA 2012.

Academic Preservation Trust Introducing APTrust. HOW THE DISCUSSION BEGAN In the beginning…

Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.

Hypatia Hydra Platform for Access to Information in Archives DLF Forum * Baltimore * October 31, 2011 Stanford University Bradley Daigle Julie Meloni Tom.

1 Designing Storage Architecture for Digital Collections 2012.

The Canadian Information Network for Research in the Social Sciences and Humanities Tim Au Yeung and Mary Westell Libraries.

A consortium committed to digital preservation Academic Preservation Trust CNI Fall Member Meeting 12/10/12 1 Robin Ruggaber, University of Virginia Michele.

Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.

Katherine Skinner, Executive Director, Educopia Institute ESOPI 2013 Chapel Hill, NC April 19, 2013.

CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration Disaster Recovery.

ALA Institutional Repository Update ALA Archives at the University of Illinois Urbana-Champaign Chris Prom Cara Bertram Denise Rayman.

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

Chronopolis – MetaArchive Improving and Strengthening Inter-Institutional Preservation.

Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University

The DuraCloud Workshop Your hosts: Bill Branan & Carissa Smith.

Archiving and Preservation Michele Kimpton CEO, DuraSpace Bryan Beecher Director, ICPSR DuraSpace Webinar November 2, 2011.

ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.

Digital Library Storage Strategies Robert Cartolano, Director Library Information Technology Office November 14, 2008.

Store, Manage, and Archive Content in the Cloud Michele Kimpton, DuraSpace CEO DuraSpace Nate Klingenstein, Internet 2 Internet 2 meeting, April 2013.

Can a Consortium Build a Viable Preservation Repository? Presentation at CNI March 31, 2014 Bradley Daigle (APTrust – University of Virginia) Stephen Davis.

Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.

DPLAfest, April 15, 2016 Chip German, Program Director, APTrust and Senior Director, Content Stewardship, at the University of Virginia Library

Discover ScholarSphere A repository service collaboration between the University Libraries and ITS.

8 November 2012, Penn State Harrisburg Linda Friend University Libraries Publishing & Curation Services.

KEEPS – a system for UELMA preservation and security

Digital Asset Management at Michigan Tech

Creating metadata out of thin air and managing large batch imports

Chip German, Program Director, APTrust

KEEPS – a system for UELMA preservation and security

APTrust and Georgetown University Library

Moving on : Repository Services after the RAE

Amazon Storage- S3 and Glacier

Bentley Project Reel Digitization Bentley Historical Library t

ArchivesSpace – Archivematica – DSpace Workflow Integration

Archiving and preservation services in the cloud

Presentation transcript:

Update at CNI, April 14, 2015 Chip German, APTrust Andrew Diamond, APTrust Nathan Tallman, University of Cincinnati Linda Newman, University of Cincinnati Jamie Little, University of Miami APTrust Web Site: aptrust.org

Sustaining Members: Columbia University Indiana University Johns Hopkins University North Carolina State University Penn State Syracuse University University of Chicago University of Cincinnati University of Connecticut University of Maryland University of Miami University of Michigan University of North Carolina University of Notre Dame University of Virginia Virginia Tech

Quick description of APTrust: The Academic Preservation Trust (APTrust) is a consortium of higher education institutions committed to providing both a preservation repository for digital content and collaboratively developed services related to that content. The APTrust repository accepts digital materials in all formats from member institutions, and provides redundant storage in the cloud. It is managed and operated by the University of Virginia and is both a deposit location and a replicating node for the Digital Preservation Network (DPN). 39 of the registrants for CNI in Seattle are from APTrust sustaining-member institutions

APTrust How it works Andrew Diamond Senior Preservation & Systems Engineer

Behind the Scenes, Step by Step

What You See, Step by Step

Validation Pending

Validation Succeeded (Detail)

Validation Failed (Detail)

Storage Pending

Record Pending

Record Succeeded

List of Ingested Objects

Object Detail

List of Files

File Detail

Pending Tasks: Restore & Delete

Completed Tasks: Restore & Delete

Detail of Completed Restore

Detail of Completed Delete

A Note on Bag Storage & Restoration

Bag Storage

Bag Restoration

Where’s my data?

On multiple disks in Northern Virginia. S3 Glacier On hard disk, tape or optical disk in Oregon.

S3 = Simple Storage Service

% Durability (9 nines)

S3 = Simple Storage Service % Durability (9 nines) Each file stored on multiple devices in multiple data centers

S3 = Simple Storage Service % Durability (9 nines) Each file stored on multiple devices in multiple data centers Fixity check on transfer

S3 = Simple Storage Service % Durability (9 nines) Each file stored on multiple devices in multiple data centers Fixity check on transfer Ongoing data integrity checks & self- healing

Glacier Long-term storage with % durability (11 nines) Not intended for frequent access Takes five hours or so to restore a file

Chances of losing an archived object?

1 in 100,000,000,000 S3

Chances of losing an archived object? 1 in 100,000,000,000 S3 Glacier 1 in 10,000,000,000,000

Chances of losing an archived object? 1 in 100,000,000,000 S3 Glacier 1 in 10,000,000,000,000 1 in 1,000,000,000,000,000,000,000,000 chance of losing both to separate events

Chances of losing an archived object? 1 in 100,000,000,000 S3 Glacier 1 in 10,000,000,000,000 1 in 1,000,000,000,000,000,000,000,000 chance of losing both to separate events Very Slim

Protecting Against Failure During Ingest

Protecting Against Failure During Ingest Microservices with state Resume intelligently after failure

Protecting Against Failure After Ingest

Protecting Against Failure After Ingest Objects & Files Metadata & Events

What if...

What if ingest services stop? They restart themselves and pick up where they left off.

What if ingest server dies? We spin up a new one in a few hours.

What if Fedora service stops? We restart it manually.

What if Fedora server dies? We can build a new one from the data on the hot backup in Oregon.

What if we lose all of Glacier in Oregon? We have all files in S3 in Virginia. Ingest services and Fedora are not affected.

What if we lose the whole Virginia data center? Fedora Ingest Services Metadata and Events Files and Objects S3

What if we lose the whole Virginia data center? Ingest Services We can rebuild services from code and config files in hours.

What if we lose the whole Virginia data center? We can restore all files from Glacier in Oregon. Files and Objects S3

What if we lose the whole Virginia data center? We can restore all metadata and events from the Fedora backup in Oregon. Fedora Metadata and Events

Metadata stored with files

Attached metadata for last-ditch recovery

To lose everything...

A major catastrophic event in Virginia and Oregon, or...

To lose everything... Don’t pay the AWS bill A major catastrophic event in Virginia and Oregon, or...

Partner Tools

Bag Management Upload Download List Delete

Partner Tools Bag Management Upload Download List Delete...or use open-source AWS libraries for Java, Python, Ruby, etc.

Partner Tools Bag Management

Partner Tools Bag Validation Validate a bag before you upload it. Get specific details about what to fix.

Partner Tools Bag Validation

Partner Tools Bag Management & Validation Download from the APTrust wiki

APTrust Repository Test Site Wiki Help

Learning to Run University of Cincinnati Libraries path to contributing content to APTrust

Digital Collection Environment ●3 digital repositories o DSpace -- UC Digital Resource Commons o LUNA -- UC Libraries Image and Media Collections o Hydra -- Institutional Repository ●1 website ●1 Storage Area Network file system

Workflow Considerations ●Preservation vs. Recovery o platform-specific bags or platform-agnostic bags? o derivative files ●Bagging level o record level or collection level?

UC Libraries Workflow Choices ●Recovery is as important as preservation o Include platform specific files and derivatives ●Track platform provenance o use bag naming convention or additional fields in bag-info.text ●Track collection membership o use bag naming convention or additional fields in bag-info.text

Bagging (crawling) 1st iteration ●Start with DSpace, Cincinnati Subway and Street Improvements collection ●Replication Task Suite to produce Bags o one bag per record ●Bash scripts to transform to APTrust Bags ●Bash scripts to upload to test repository ●Worked well, but… Scripts available at

Bagging (walking) 2nd iteration ●Same starting place ●Export collection as Simple Archive Format o ~ 8,800 records, ~ 375 GB ●Create bags using bagit-python o 250 GB is bag-size upper limit

Bagging (walking) 2nd iteration (continued) ●Collection had three series o two series under 250 GB ▪ each series was put into a single bag o one series over 250 GB ▪ used a multi-part bag to send to APTrust, APTrust merges split bags into one ●Bash script to upload bags to production repository

Bagging (running) 3rd iteration ●Testing of tools created by Jamie Little of University of Miami ●Jamie can tell you more but…

Long-Term Strategy (marathons) ●Bag for preservation and recovery ●For most content o one bag per collection ▪ except for o manual curation and bagging ▪ except for ●Use bag-info.txt to capture platform provenance and collection membership

at

UM Digital Collections Environment ●File system based long-term storage ●CONTENTdm for public access

Creating a workflow ●Began at the command line working with LOC’s BagIt Java Library ●Bash scripts ●Our initial testing used technical criteria (size, structural complexity, file types)

What to put in the bag? ●CONTENTdm metadata ●FITS technical metadata ●We maintained our file system structure in the bags

Beginning APTrust Testing ●Began uploading whole collections as bags ●Early attempts at a user interface using Zenity

UM’s APTrust Automation ●Sinatra based server application ●HTML & JavaScript interface

Starting Screen

Choosing a Repository ID

Choosing Files

Viewing Download Results

Final Upload

Bagging Strategies Moved from bag per collection strategy to bag per file Changed interface to allow selection of individual files

APTrust Benefits for Developers ●Wide range of open source tools available ●Uploading to APTrust is no different than uploading to S3

Where to find us github.com/UMiamiLibraries

Why are we doing this again? ●Libraries’ and Archives’ historic commitment to long- term preservation is the raison d'être, a core justification, for our support of repositories. It’s why our stakeholders trust us and will place their content in our repositories. ●We expect APTrust to be our long-term preservation repository to meet our responsibilities to our stakeholders.

Why are we doing this again? ●We know from some experience, and reports from our peers, that backup procedures, while still essential, are fraught with holes. ●We expect APTrust to be our failover if any near-term restore from backup proves incomplete.

Linda Newman Head, Digital Collections and Repositories UC Libraries, Digital Collections and Repositories Nathan Tallman Digital Content Strategist