Download presentation
Presentation is loading. Please wait.
Published bySara Kelley Modified over 11 years ago
1
Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010
2
www.hathitrust.org HathiTrust project profile Launched October 2008 26 member institutions and growing 99% Google-scanned materials 5.6 million volumes, 350 pages average 210 terabytes 2 US sites
3
www.hathitrust.org Founding principles Long-term digital preservation Access to materials Digital formats allow simultaneous preservation and access Pioneer the concept of a universal, non- commercial, collaborative digital library
4
www.hathitrust.org Preservation models Open Archical Information System (OAIS) – Provides formal guidelines for object storage and retrieval. – We incorporate these principles. Trusted Repository Audit Checklist (TRAC) – Provides auditing framework for assuring reliability: policy management, documentation, succession. – Our TRAC audit recently conducted. Still evolving!
5
www.hathitrust.org Preservation architecture content Google: multi-purpose index and advertising engine HathiTrust: preservation- oriented service architecture managed repository API index service
6
www.hathitrust.org Access services Bibliographic catalog for traditional discovery – Based on open-source Vufind system Full-text search – Based on open-source Solr system Page turner for online reading – Based on our older open-source digital library system – Only public-domain books according to US copyright law APIs for building new services
7
Search full-text of this item Save item to new or existing collection http://babel.hathitrust.org/cgi/mb Image, text, or pdf views
8
www.hathitrust.org Funding and Governance Major financial support from UM and Indiana University Cost-recovery based on content deposited Executive committee – deans and CIOs of founding institutions, executive director – budget and major initiatives Strategic advisory board – Representatives of member institutions – development priorities, policy development
9
www.hathitrust.org Staffing and server infrastructure Significant developer staff contributed by UM One sysadmin of three in Core Services funded by HT Basic infrastructure cost: $3.86/GB – 1 420TB Isilon storage cluster per site Linear cost increment for adding new storage – Tape backup – 2 web, 1 database, 4 search servers in both sites – 5 ingest, 2 index-building servers in Michigan – Shared development environment
10
www.hathitrust.org Material and Data Flow ingest web sync Google or other scanning project storage @UM storage @IU network or media delivery catalog rights database web index
11
www.hathitrust.org Automated Data Ingest Handles per-volume logistics Rigorously validate identifier, object files, object completeness Generate METS (XML) object inventory Determine copyright by date and place of publication 500K+ volumes/month!
12
www.hathitrust.org Data Characteristics 1 METS (XML) and 1 Zip archive (JPEG2000 and TIFF images and OCR text) per book 36MB average Zip file size (compressed) Layout uses pairtree, an IETF draft RFC developed at California Digital Library 39015123456789 pairtree_root/39/01/51/23/45/67/89/39015123456789
13
www.hathitrust.org Search infrastructure web storage search 123101112 web … … 1 3 4 5 6 7 1.Query submission 2.Query distribution 3.Query processing 4.Results combining 5.Results display 6.Object retrieval 7.Object display 2
14
www.hathitrust.org Optimizing search infrastructure How many books/shard? …shards/server? …memory/server? right size (600K books/shard) good performance (ms)
15
Questions? Cory Snavely csnavely@umich.edu www.hathitrust.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.