digital archival storage for the University of Michigan Library collections
Project Overview Project partnership with Google publicly announced in December 2004. Bound print collection, about 7 million volumes, to be scanned over estimated four to six years. Direct scanning costs are borne by Google.
Project Overview UM receives a copy of all digital files, including OCR and metadata, which we may use to build services. UM may share files with other research libraries under formal agreements. UM may not redistribute content en masse to other commercial services or the public. All uses are subject to copyright.
Project Scale At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 billion files. At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data. Production at full volume can scan about 35K volumes (1867 GB) per week, which averages to a sustained 3.16 MB per second for four years.
Not too many libraries do this!
Characteristics of the Data Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. A true archival system; indefinite retention requires its own set of best practices. Files are largely static. Much material is in-copyright (security is paramount).
Application Requirements MBooks (web server farm/NAS) Periodic fixity check (checksum validation) Full-text search? (how?!) Textual analysis or other research? Anything beyond MBooks is likely to be either compute- or IO-intensive, or both. This is how you annoy storage vendors!
Overall Approach Engagement with Office of the Provost from the beginning; a University project housed in the Library Our Library IT environment has unusual depth due to our mature digital library. Consulting relationship with academic computing and campus storage experts RFI provided vendor landscape RFP (very few Yes/No questions!)
Cost Model from RFI Responses Model includes various ramp-up patterns, hardware replacement periods, starting cost, and rate of cost decrease. Cost per GB from selected RFI responses: average = median = $7 Too fast means initial investment is huge, no benefit from Moore’s Law. Too slow means simultaneous growth and replacement, costs peak at replacement interval. Four years is plenty fast, thank you!
Potential Funding Sources Development of CIC shared digital repository: multiple redundant sites and some staff funded by pay-to-play model Again, engagement with Office of the Provost from the beginning
Considerations “Future-proof” higher-cost investment with proven vendor and incremental upgrades? “Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade? Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed project to proceed and further inform us on the decisions we’ll make.
Best Architecture? Must have simultaneous access from potentially many front-end servers (cluster), so almost certainly a NAS component. NAS? NAS gateway to SAN? NAS/SAN hybrid? Probably most promising in the flexibility department are the clustered NAS systems with SAS or SATA back ends. Keep our options open; the right vendor could make all the difference.
Highlights of the RFP Does not ask about compliance with exact specifications, but asks for detailed explanations of system architecture: all of the usual, and… Recommended upgrade path given our estimated growth pattern and project timeline Description of how load balancing and service are impacted as system is scaled and maintained How virtualization is implemented Security provisions Contact me if you’d like to have a copy.
Proposal Evaluation Criteria Scalability of capacity, performance, and interconnect fabric Proven models/methods for growth Flexibility in application Maintenance ease
Near-term Work RFP responses due (Monday!) Space, support, backup Work in CIC on governance and funding model for shared digital repository Continued development of MBooks functionality and integration with existing digital library resources
Access MBooks http://www.lib.umich.edu/mdp/ Cory Snavely csnavely@umich.edu