Presentation is loading. Please wait.

Presentation is loading. Please wait.

HATHI TRUST A Shared Digital Repository Columbia University and HathiTrust Collaboration at a new level.

Similar presentations


Presentation on theme: "HATHI TRUST A Shared Digital Repository Columbia University and HathiTrust Collaboration at a new level."— Presentation transcript:

1 HATHI TRUST A Shared Digital Repository Columbia University and HathiTrust Collaboration at a new level

2 Outline About HathiTrust – Mission & Goals Background What we do (services) – Objectives Governance Partnership & Resources Technology Future Directions

3 About

4 What is HathiTrust Universal Digital Library Common Goal Single Entity but Partnership of Many Libraries

5 Goals Reliable and comprehensive archive of materials converted from print…co-owned Ensure the long-term preservation of content Improve access …to meet the needs of the co- owning institutions Coordinate shared storage strategies “public good” …sustaining the historical record Simultaneously …centralized …open

6 Background

7 History Michigan Digitization Project 2004 “…U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation…”

8 History Collective Agreement with CIC Announced in June 2007 – U of Michigan and U of Wisconsin Projects already underway

9 History In 2007, CIC agreed to establish a shared digital repository University of Michigan and Indiana University initial leaders of this effort

10 History CIC Shared Digital Repository HathiTrust

11 The Partners When announced in October 2008, partners included: – University of California system – CIC (Committee on Institutional Cooperation) – University of Virginia University of Chicago University of Illinois Indiana University University of Iowa University of Michigan Michigan State University University of Minnesota Northwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison Columbia University

12 The Name The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy

13 Content Distribution 5,317,545 - Total 764,103 - Public Domain

14 Content Growth

15 What we do

16 Services Bit-level preservation and migration Long-term preservation Rights database Copyright review Rights management Viewing Redistribution Print disabilities Section 108 Access (within bounds of law and settlement) Collection Builder Publish virtual collections Metadata files Bib API Data API Availability of data Inbound validation Fixity checks Google ingest Temporary catalog Version 1 permanent catalog April 2010 Bibliographic search November 2009 Full-text search UM public domain UM Press Print on Demand

17 Functional Objectives Improved performance PageTurner Right now at UM only Plans to extend Access for users with print disabilities Collection Builder Publish virtual collections Identification at partner institutions Branding Temporary catalog Version 1 permanent catalog April 2010 Public discovery interface Metadata files Bib API Data API Strategies for Openness IA-digitized locally-digitized Non-Google digitized print content Full-PDF download Collection Builder Section 108 (later on) Users with print disabilities (later on) Extending services through Shibboleth Index optimization Ongoing hardware acquisition Improvements to large-scale search Audio pilot Images (maps) Non-book/non- journal content Beginning to investigate ePub as a delivery format Born-digital Isilon software June 2010 Fixity checking Including outstanding areas like disaster recover Ongoing basis Compliance with TRAC PageTurner Advanced search Search facets Collection Builder Collaborative Development Environment Research Center Data distribution Tools such as SEASR Data mining tools

18 Governance

19 HathiTrust Executive Committee Strategic Advisory Board Budget/Finances Decision-making Policy Planning

20 Executive Committee Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Director of Libraries, UW - Madison Brenda Johnson, Dean of Libraries, IU Brad Wheeler, Chief Information Officer, IU John Wilkin, Executive Director of HathiTrust and Associate University Library, LIT, UM

21 Strategic Advisory Board Ed Van Gemert (Chair), Director of Libraries, UW - Madison John Butler, Associate University Librarian for Information Technology, U Minn Patricia Cruse, Director, Preservation, CDL Bernie Hurley, Director, Library Technologies, UC Berkeley R. Bruce Miller, University Librarian, UC - Merced Sarah Pritchard, University Librarian, Northwestern Paul Soderdahl, Director, LIT, U Iowa John Wilkin, Executive Director, HathiTrust (ex officio)

22 Partnership & Resources

23 Partnership & Resources (1) Funded for a initial 5 years with base-funding from partners Budget – separately held within UMich budget system, managed by the Executive Committee Cost Model – Per GB cost of storage per year with a one-time fee on new content to build a capital fund Review in 3 rd yr of each 5 yr period

24 Partnership & Resources (2) Staff/Expertise – highly integrated – Project managers, IT and communications staff, copyright experts, administrators (UM, Indiana and UC taking the lead) Working groups UM recently hired a Digital Preservation Librarian Shared development space

25 e-Commerce Print on Demand Content Ingest Transformation Validation Content Access PageTurner Collection Builder Large-scale Search Bibliographic Catalog Research Center APIs Quality Assurance Quality Review Content Certification User Services Usability User support (helpdesk) Outreach Project website Monthly newsletter Papers and presentations Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy Governance Budget, Finances Decision-making Policy Planning Enterprise Management Communication and Coordination with partner institutions Project management Repository Administration Hardware configuration and maintenance Web and application server configuration and maintenance Security Permissions Logging Repository Administration Data management (content storage, backup, integrity checks, deletion) Hardware selection and replacement Content and Metadata specifications Disaster Recovery Processes for ensuring content integrity Rights Management Copyright determination Copyright review Copyright information management (database) Rightsholder permissions Bibliographic Data Management Entity description (record-level) Object identification (item-level) Data availability Collection Development Digital Expansion beyond books and journals (born-digital, images and maps, audio) Selection of content (for non- Google volume ingest and pilots projects) Print Cloud Library (effect of digital on print) Financial contributions of partners HathiTrust Functional Framework

26 Partnership & Resources (3) Toward a Cloud Library – CLIR, Mellon Foundation – OCLC Research, NYU, HathiTrust, Recap Libraries Objective: Characterize the near-term opportunity for externalizing management of academic research collections leveraging capacity of large-scale shared print and digital repositories* Outcomes: opportunity and risk assessment based on aggregate collection analysis; draft service agreement enabling generic consumer library to selectively outsource preservation and access of low-use research collections to large-scale print and digital repositories *From the RLG Partner Update January 7, 2010

27 Partnership & Resources (4) CRL TRAC Audit – Portico and HathiTrust assessments timely – “Certification will augment CRL’s strategic archiving of print, and support a responsible transition to electronic- only formats where appropriate.” – Work with UC to design shared print journal archiving effort – “With this hybrid strategy CRL hopes to enable its community to accelerate the shift to electronic-only resources in a careful and responsible manner.” * http://www.crl.edu/archiving-preservation/digital- archives/certification-and-assessment-digital-repositories

28 Partnership & Resources (5) New cost model Based on benefits to institutions – Public Domain – In-copyright Volumes “held” Covered by Settlement – Print replacement, users with print disabilities; research corpus Not – Section 108; expand via authentication

29 Partnership & Resources (6) Timeline: – Implement in 2013 – Accept new partners now with costs based on overlap calculations Requirements: – Print holdings database – Update mechanisms – Manual remediation

30 Partnership & Resources (7) Print holdings database will also benefit – De-duplication Compromises user experience, obscures collection development needs – Management of print volumes Information to withdraw volumes (journals) – Legal uses of copyright materials Section 108, 121, ADA uses will depend knowledge of which institutions own(ed) which mate

31 Technology

32 Technology - OAIS GRIN Internal Data Loading GRIN Internal Data Loading Google [OCA] In-house Conversion Google [OCA] In-house Conversion MARC record extensions (Aleph) Rights DB MARC record extensions (Aleph) Rights DB Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS object PNG OCR PDF METS object PNG OCR PDF Isilon Site Replication TSM MD5 checksum validation Isilon Site Replication TSM MD5 checksum validation GROOVE (JHOVE) GROOVE (JHOVE) ;

33 Technology – Architecture Inbound validation, standards-based object storage and related metadata Storage in Ann Arbor and Indianapolis Encrypted backup to 3 rd location Rights database for rights metadata Online catalog as source and storage for descriptive metadata

34 Technology - Ingest Automatic validation in GROOVE – Check barcode check digit using Luhn algorithm – Fixity check on JPG2000, TIFF, UTF8 using MD5 – Well-formedness and embedded metadata check on JPG2000, TIFF, UTF8 using JHOVE

35 Simple filesystem layout – One directory per volume, zip file and METS file – Use of a namespace allows for conflicting identifiers – Namespaces for institutions and, if needed, types of identifiers within the institution Technology - Repository

36 Why METS? – Can serve as Archival Information Package and a Dissemination Information Package – Designed to record the relationship between pieces of complex digital objects – Can be created automatically as texts are loaded or reloaded – Preservation actions (PREMIS) Technology – METS Object

37 What’s there? – metsHdr with an ID and CREATEDATE – 2 dmdSecs: Marcxml and mdRef – amdSec containing one techMD with PREMIS metadata – fileSec with 4 fileGrps (zip, images, OCR, hOCR) – Physical structMap tying together files with metadata (pg. numbers and features) Technology – METS Object

38 Future Directions

39 Partner Institutions Usage reporting Partner Institutions Holdings database SAB working group Quality SAB working group) De-duplication SAB working group OCLC catalog SAB 3-year review CB Integration Advanced search/facets Index optimization Ongoing hardware acquisition Improvements to Large-scale Search UC and GnuBook Partner Institutions Improvements to PageTurner Research Center Data distribution Tools such as SEASR Data mining tools Wisconsin Ingest reporting University of California New bibliographic management University of California Content validation Full-PDF download Collection Builder Section 108 (later on) Users with print disabilities (later on) Extending services through Shibboleth Audio pilot Images (maps) Non-book/non- journal content Beginning to investigate ePub as a delivery format Born-digital Data API Strategies for Openness IA-digitized locally-digitized Non-Google digitized print content Isilon software June 2010 Fixity checking Including outstanding areas like disaster recover ongoing areas Compliance with TRAC PageTurner Advanced search Search facets Collection Builder Collaborative Development Environment NSF EAGER Mellon Quality Grant projects

40 Thank You! jjyork@umich.edu http://www.hathitrust.org


Download ppt "HATHI TRUST A Shared Digital Repository Columbia University and HathiTrust Collaboration at a new level."

Similar presentations


Ads by Google