Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010
HathiTrust project profile Launched October member institutions and growing 99% Google-scanned materials 5.6 million volumes, 350 pages average 210 terabytes 2 US sites
Founding principles Long-term digital preservation Access to materials Digital formats allow simultaneous preservation and access Pioneer the concept of a universal, non- commercial, collaborative digital library
Preservation models Open Archical Information System (OAIS) – Provides formal guidelines for object storage and retrieval. – We incorporate these principles. Trusted Repository Audit Checklist (TRAC) – Provides auditing framework for assuring reliability: policy management, documentation, succession. – Our TRAC audit recently conducted. Still evolving!
Preservation architecture content Google: multi-purpose index and advertising engine HathiTrust: preservation- oriented service architecture managed repository API index service
Access services Bibliographic catalog for traditional discovery – Based on open-source Vufind system Full-text search – Based on open-source Solr system Page turner for online reading – Based on our older open-source digital library system – Only public-domain books according to US copyright law APIs for building new services
Search full-text of this item Save item to new or existing collection Image, text, or pdf views
Funding and Governance Major financial support from UM and Indiana University Cost-recovery based on content deposited Executive committee – deans and CIOs of founding institutions, executive director – budget and major initiatives Strategic advisory board – Representatives of member institutions – development priorities, policy development
Staffing and server infrastructure Significant developer staff contributed by UM One sysadmin of three in Core Services funded by HT Basic infrastructure cost: $3.86/GB – 1 420TB Isilon storage cluster per site Linear cost increment for adding new storage – Tape backup – 2 web, 1 database, 4 search servers in both sites – 5 ingest, 2 index-building servers in Michigan – Shared development environment
Material and Data Flow ingest web sync Google or other scanning project network or media delivery catalog rights database web index
Automated Data Ingest Handles per-volume logistics Rigorously validate identifier, object files, object completeness Generate METS (XML) object inventory Determine copyright by date and place of publication 500K+ volumes/month!
Data Characteristics 1 METS (XML) and 1 Zip archive (JPEG2000 and TIFF images and OCR text) per book 36MB average Zip file size (compressed) Layout uses pairtree, an IETF draft RFC developed at California Digital Library pairtree_root/39/01/51/23/45/67/89/
Search infrastructure web storage search web … … Query submission 2.Query distribution 3.Query processing 4.Results combining 5.Results display 6.Object retrieval 7.Object display 2
Optimizing search infrastructure How many books/shard? …shards/server? …memory/server? right size (600K books/shard) good performance (ms)
Questions? Cory Snavely