digital archival storage

Slides:



Advertisements
Similar presentations
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Advertisements

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
An update on Google Book search digitization at the University of Michigan … the agreement and plans for work between Google and the.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
Improving Your Cash Management with Solution Defined Deposit Express provides an innovative way to make deposits remotely and is now available.
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
October 24, 2006Merit Technical Staff Meeting1 The Google Project at the University of Michigan Perry Willett Head, Digital Library Production Service.
Take your CMS to the cloud to lighten the load Brett Pollak Campus Web Office UC San Diego.
Digital archival storage for the University of Michigan Library collections.
Network Design and Implementation
11© 2011 Hitachi Data Systems. All rights reserved. HITACHI DATA DISCOVERY FOR MICROSOFT® SHAREPOINT ® SOLUTION SCALING YOUR SHAREPOINT ENVIRONMENT PRESENTER.
ARCHIVES AND ACCOUNTABILITY IN THE DIGITAL AGE Fran Blouin Director, Bentley Library University of Michigan Copenhagen City Archives February 2009.
Cornell Institute for Digital Collections Digital Technologies and Access At Cornell University Peter B. Hirtle Cornell Institute for Digital Collections.
Developing PANDORA Mark Corbould Director, IT Business Systems.
SDLC Phase 2: Selection Dania Bilal IS 582 Spring 2009.
STEALTH Content Store for SharePoint using Windows Azure  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"
Microsoft ® SQL Server ® 2008 and SQL Server 2008 R2 Infrastructure Planning and Design Published: February 2009 Updated: January 2012.
Delivering a New Desktop and Application Deployment Strategy Indiana University and the New Emerging Personal Computing Model Duane Schau
Digitizing Project Components Planning Document Prep Scanning Post Scan Processing Data Loading Document De-Prep Interface Creation Publicity Maintenance.
MSS Technologies and the AIIM Grand Canyon Chapter present: Electronic Document Management System Needs Analysis.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
STEALTH Content Store for SharePoint using Caringo CAStor  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"
Global Data Flows Study SDCG-8, Session 6 Frank Martin Seifert (ESA); Gene Fosnight (USGS) 24 Sep 2015, Bonn.
Ceph Storage in OpenStack Part 2 openstack-ch,
Mass digitisation? Astrid Verheusen Projectmanager Research & Development Division National library of the Netherlands LIBER-EBLIDA Workshop on Digitisation.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Storage Trends: DoITT Enterprise Storage Gregory Neuhaus – Assistant Commissioner: Enterprise Systems Matthew Sims – Director of Critical Infrastructure.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
From Your Archive to the Web: Managing the Project The digitization of the Historic Photograph Collection of the Public Library of Brookline Digital Commonwealth/
SDLC 1: Systems Planning and Selection Dania Bilal IS 582 Spring 2008.
HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010.
CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,
National Library of the Czech Republic as End-User of the Research Networks Adolf Knoll deputy director
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
Hosted by Creating RFPs for Tape Libraries Dianne McAdam Senior Analyst and Partner Data Mobility Group.
Chapter 12 The Network Development Life Cycle
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
Practical IT Research that Drives Measurable Results Leverage Server Virtualization for DR Affordability and Agility 1Info-Tech Research Group.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Advanced Higher Computing Science
Information Technology Virtualized Server Update
Managing Explosive Data Growth
Chapter 6: Securing the Cloud
Organizations Are Embracing New Opportunities
Committee on Information Technology Planning and Budget Sub-Committee
What is Cloud Computing - How cloud computing help your Business?
Experiences and Outlook Data Preservation and Long Term Analysis
Agenda Backup Storage Choices Backup Rule
Azure Hybrid Use Benefit Overview
Section 15.1 Section 15.2 Identify Webmastering tasks
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Physical Architecture Layer Design
2 Selecting a Healthcare Information System.
Real IBM C exam questions and answers
Networks Software.
Design Unit 26 Design a small or home office network
DIGITAL LIBRARY.
Storage Trends: DoITT Enterprise Storage
Introduction to Databases Transparencies
AWS Cloud Computing Masaki.
Support of the Missouri Compacts
Kate Sweeney Bell Recorder
Long-Lived Data Collections
PerformanceBridge Application Suite and Practice 2.0 IT Specifications
Information Technology Virtualized Server Proposal
Current Challenges in Digitization
Mary Miller Director of Collection Management & Preservation
Presentation transcript:

digital archival storage for the University of Michigan Library collections

Project Overview Project partnership with Google publicly announced in December 2004. Bound print collection, about 7 million volumes, to be scanned over estimated four to six years. Direct scanning costs are borne by Google.

Project Overview UM receives a copy of all digital files, including OCR and metadata, which we may use to build services. UM may share files with other research libraries under formal agreements. UM may not redistribute content en masse to other commercial services or the public. All uses are subject to copyright.

Project Scale At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 billion files. At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data. Production at full volume can scan about 35K volumes (1867 GB) per week, which averages to a sustained 3.16 MB per second for four years.

Not too many libraries do this!

Characteristics of the Data Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. A true archival system; indefinite retention requires its own set of best practices. Files are largely static. Much material is in-copyright (security is paramount).

Application Requirements MBooks (web server farm/NAS) Periodic fixity check (checksum validation) Full-text search? (how?!) Textual analysis or other research? Anything beyond MBooks is likely to be either compute- or IO-intensive, or both. This is how you annoy storage vendors!

Overall Approach Engagement with Office of the Provost from the beginning; a University project housed in the Library Our Library IT environment has unusual depth due to our mature digital library. Consulting relationship with academic computing and campus storage experts RFI provided vendor landscape RFP (very few Yes/No questions!)

Cost Model from RFI Responses Model includes various ramp-up patterns, hardware replacement periods, starting cost, and rate of cost decrease. Cost per GB from selected RFI responses: average = median = $7 Too fast means initial investment is huge, no benefit from Moore’s Law. Too slow means simultaneous growth and replacement, costs peak at replacement interval. Four years is plenty fast, thank you!

Potential Funding Sources Development of CIC shared digital repository: multiple redundant sites and some staff funded by pay-to-play model Again, engagement with Office of the Provost from the beginning

Considerations “Future-proof” higher-cost investment with proven vendor and incremental upgrades? “Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade? Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed project to proceed and further inform us on the decisions we’ll make.

Best Architecture? Must have simultaneous access from potentially many front-end servers (cluster), so almost certainly a NAS component. NAS? NAS gateway to SAN? NAS/SAN hybrid? Probably most promising in the flexibility department are the clustered NAS systems with SAS or SATA back ends. Keep our options open; the right vendor could make all the difference.

Highlights of the RFP Does not ask about compliance with exact specifications, but asks for detailed explanations of system architecture: all of the usual, and… Recommended upgrade path given our estimated growth pattern and project timeline Description of how load balancing and service are impacted as system is scaled and maintained How virtualization is implemented Security provisions Contact me if you’d like to have a copy.

Proposal Evaluation Criteria Scalability of capacity, performance, and interconnect fabric Proven models/methods for growth Flexibility in application Maintenance ease

Near-term Work RFP responses due (Monday!) Space, support, backup Work in CIC on governance and funding model for shared digital repository Continued development of MBooks functionality and integration with existing digital library resources

Access MBooks http://www.lib.umich.edu/mdp/ Cory Snavely csnavely@umich.edu