Download presentation
Presentation is loading. Please wait.
Published byDeshawn Biringer Modified over 9 years ago
1
digital archival storage for the University of Michigan Library collections
2
Project Overview Project partnership with Google publicly announced in December 2004. Project partnership with Google publicly announced in December 2004. Bound print collection, about 7 million volumes, to be scanned over estimated four to six years. Bound print collection, about 7 million volumes, to be scanned over estimated four to six years. Direct scanning costs are borne by Google. Direct scanning costs are borne by Google.
3
Project Overview UM receives a copy of all digital files, including OCR and metadata, which we may use to build services. UM receives a copy of all digital files, including OCR and metadata, which we may use to build services. UM may share files with other research libraries under formal agreements. UM may share files with other research libraries under formal agreements. UM may not redistribute content en masse to other commercial services or the public. UM may not redistribute content en masse to other commercial services or the public. All uses are subject to copyright. All uses are subject to copyright.
4
Project Scale At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 billion files. At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 billion files. At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data. At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data. Production at full volume can scan about 35K volumes (1867 GB) per week, which averages to a sustained 3.16 MB per second for four years. Production at full volume can scan about 35K volumes (1867 GB) per week, which averages to a sustained 3.16 MB per second for four years.
5
Not too many libraries do this!
6
Characteristics of the Data Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. A true archival system; indefinite retention requires its own set of best practices. A true archival system; indefinite retention requires its own set of best practices. Files are largely static. Files are largely static. Much material is in-copyright (security is paramount). Much material is in-copyright (security is paramount).
7
Application Requirements MBooks (web server farm/NAS) MBooks (web server farm/NAS) Periodic fixity check (checksum validation) Periodic fixity check (checksum validation) Full-text search? (how?!) Full-text search? (how?!) Textual analysis or other research? Textual analysis or other research? Anything beyond MBooks is likely to be either compute- or IO-intensive, or both. Anything beyond MBooks is likely to be either compute- or IO-intensive, or both. This is how you annoy storage vendors! This is how you annoy storage vendors!
8
Overall Approach Engagement with Office of the Provost from the beginning; a University project housed in the Library Engagement with Office of the Provost from the beginning; a University project housed in the Library Our Library IT environment has unusual depth due to our mature digital library. Our Library IT environment has unusual depth due to our mature digital library. Consulting relationship with academic computing and campus storage experts Consulting relationship with academic computing and campus storage experts RFI provided vendor landscape RFI provided vendor landscape RFP (very few Yes/No questions!) RFP (very few Yes/No questions!)
9
Cost Model from RFI Responses Model includes various ramp-up patterns, hardware replacement periods, starting cost, and rate of cost decrease. Model includes various ramp-up patterns, hardware replacement periods, starting cost, and rate of cost decrease. Cost per GB from selected RFI responses: average = median = $7 Cost per GB from selected RFI responses: average = median = $7 Too fast means initial investment is huge, no benefit from Moore’s Law. Too fast means initial investment is huge, no benefit from Moore’s Law. Too slow means simultaneous growth and replacement, costs peak at replacement interval. Too slow means simultaneous growth and replacement, costs peak at replacement interval. Four years is plenty fast, thank you! Four years is plenty fast, thank you!
10
Potential Funding Sources Development of CIC shared digital repository: multiple redundant sites and some staff funded by pay-to-play model Development of CIC shared digital repository: multiple redundant sites and some staff funded by pay-to-play model Again, engagement with Office of the Provost from the beginning Again, engagement with Office of the Provost from the beginning
11
Considerations “Future-proof” higher-cost investment with proven vendor and incremental upgrades? “Future-proof” higher-cost investment with proven vendor and incremental upgrades? “Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade? “Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade? Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed project to proceed and further inform us on the decisions we’ll make. Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed project to proceed and further inform us on the decisions we’ll make.
12
Best Architecture? Must have simultaneous access from potentially many front-end servers (cluster), so almost certainly a NAS component. Must have simultaneous access from potentially many front-end servers (cluster), so almost certainly a NAS component. NAS? NAS gateway to SAN? NAS/SAN hybrid? NAS? NAS gateway to SAN? NAS/SAN hybrid? Probably most promising in the flexibility department are the clustered NAS systems with SAS or SATA back ends. Probably most promising in the flexibility department are the clustered NAS systems with SAS or SATA back ends. Keep our options open; the right vendor could make all the difference. Keep our options open; the right vendor could make all the difference.
13
Highlights of the RFP Does not ask about compliance with exact specifications, but asks for detailed explanations of system architecture: all of the usual, and… Does not ask about compliance with exact specifications, but asks for detailed explanations of system architecture: all of the usual, and… Recommended upgrade path given our estimated growth pattern and project timeline Recommended upgrade path given our estimated growth pattern and project timeline Description of how load balancing and service are impacted as system is scaled and maintained Description of how load balancing and service are impacted as system is scaled and maintained How virtualization is implemented How virtualization is implemented Security provisions Security provisions Contact me if you’d like to have a copy. Contact me if you’d like to have a copy.
14
Proposal Evaluation Criteria Scalability of capacity, performance, and interconnect fabric Scalability of capacity, performance, and interconnect fabric Proven models/methods for growth Proven models/methods for growth Flexibility in application Flexibility in application Maintenance ease Maintenance ease
15
Near-term Work RFP responses due (Monday!) RFP responses due (Monday!) Space, support, backup Space, support, backup Work in CIC on governance and funding model for shared digital repository Work in CIC on governance and funding model for shared digital repository Continued development of MBooks functionality and integration with existing digital library resources Continued development of MBooks functionality and integration with existing digital library resources
16
Access MBooks MBookshttp://www.lib.umich.edu/mdp/ Cory Snavely Cory Snavelycsnavely@umich.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.