Download presentation
Presentation is loading. Please wait.
Published byBrian Clarke Modified over 11 years ago
1
HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011
2
Outline Overview Partnership Mission/Goals Collections Services Collaboration Mission and Goals
3
Overview
4
Current Partners Arizona State University Baylor University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Library of Congress Massachusetts Institute of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota The University of North Carolina at Chapel Hill University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin- Madison Utah State University Yale University Library
5
Mission To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge
6
Collections and Collaboration Comprehensive collection Preservation…with Access Shared strategies – Collection management, development – Copyright – Preservation – Efficient user services Public Good
7
Collections
8
What is in HathiTrust? 8,725,092 Total volumes 2,367,111 Public Domain 4,774,782 Book titles 211,688 Serial titles * As of May 20, 2011
9
Content Sources * As of May 1, 2011
10
Content Distribution * As of May 1, 2011
11
Dates * As of May 1, 2011
12
Breakdown of HathiTrust book corpus by publication date Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011
13
Breakdown of HathiTrust book corpus by publication date
14
Language Distribution (1) The top 10 languages make up ~86% of all content * As of May 1, 2011
15
Language Distribution (2) The next 40 languages make up ~13% of total * As of May 1, 2011
16
Content over time * As of May 1, 2011
17
Content Growth
18
Services: Preservation, Access
19
Services (1) Ingest – Book and Journal content Google Internet Archive In-house, other vendor digitization – Images, Audio, Born digital (coming soon…) Two parts – Bibliographic Data – Content
20
Services (2) Long-term preservation – Bit-level, migration – Standard and open formats (ITU G4 TIFF, JPEG2000, JPG, Unicode) – Validation, integrity, redundancy – OAIS How reliable is it? – DRAMBORA, TRAC
21
Technology - OAIS GRIN Internal Data Loading GRIN Internal Data Loading Google Internet Archive In-house Conversion Google Internet Archive In-house Conversion MARC record extensions (Aleph) Rights DB MARC record extensions (Aleph) Rights DB Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS object PNG OCR PDF METS object PNG OCR PDF Isilon Site Replication TSM MD5 checksum validation Isilon Site Replication TSM MD5 checksum validation GROOVE (JHOVE) GROOVE (JHOVE) ;
22
Quality Partner Digitization Google Digitization Quality work / Volume certification feedback@issues.hathitrust.org
23
Services (3) Preservation…with Access – As part of preservation, service to partners, and as public good – Discovery Bibliographic (temporary catalog, OCLC/HathiTrust catalog) Full-text – Reading Interface optimized for users with print disabilities – Collections
36
Descriptive headings added (hidden from GUI with CSS) Info about SSD service & link to accessibility page Images used for style are in css so no need to use alt tags Skip navigation link Access keys for navigating pages with keyboard Added labels & descriptive titles to forms & ToC table
42
Type of work Search – Bib and Full text ViewFull-PDF download Print on Demand Print disabilities Section 108 (preservation uses) Public domain worldwide World World if no restrictions, Partners if restrictions WorldPartners worldwide N/A Public domain in the US WorldUSUS if no restrictions, US partners if restrictions USUS Partners N/A Open Access (+Creative Commons) World World if no restrictions World with permission Partners worldwide if no restrictions N/A In copyright (and undetermin ed) WorldNot available Partners US and worldwide, where applicable Access Matrix
43
Services (4) Rights Management – Rights Database – Copyright review IMLS Grant awarded to University of Michigan 2008 to determine copyright status of books published in US between 1923 and 1963 18 staff members, 4 institutions – Indiana University – University of Michigan – University of Minnesota – University of Wisconsin 125k reviewed through CRMS 67,000 (54%) in public domain
44
Services (5) Data Availability – Tab-delimited inventory files – Bibliographic API – Data API – OAI feed of public domain – SFX target – Summon
45
Some Examples of Use Catalogs – UM loaded every record – Chicago links to public domain volumes owned in print – TROVE harvesting through OAI – OCLC loads records into OCLC Link Resolves – UC created SFX target Vendors – H.W. Wilson database links to public domain volumes – ProQuest full-text index via Summon
49
Services (6) Collaborative Development Environment – Active repository development Support for Computational Research – Datasets 120,000-volume set Google-digitized public domain – Protocol-based access – Research Center
50
How Different from Google? Preservation Content Collective work Uses of materials Own trajectory Partnership – Not just about digital content or repository – Address challenges – Fulfill mission – Provide services for our communities
51
Collaboration: Print Storage
52
A global change in the library environment June 2010 Median duplication: 31% June 2009 Median duplication: 19% Academic print book collection already substantially duplicated in mass digitized book corpus
53
Continuing growth of overlap … ARL overlap – 31% in June 2010 – 33% in Dec (adjustment: adding little-held works) – ~ 1% per 225,000 vols – 38% in May, 2011; 45% by December, 2011 Oberlin Group overlap – 41% in December, 2010 – Higher rate of overlap per added volume? – Close to 50% in May, 2011
54
Digitized Books in Shared Repositories ~75% of mass digitized corpus is backed up in one or more shared print repositories ~3.5M titles ~2.5M
55
Cost Model Based on overlap with print collections – Public Domain / In-copyright Print Holdings Database – Costs – Lawful uses of materials – Complete picture – Volumes institutions own or have owned OCLC number; Bib record ID; Condition; Holding Status
56
Collaboration: Copyright
57
Copyright status of books published pre-1923 and US works published 1923-1963
58
Public domain, in-copyright, and orphan works, pre-1923 and 1923-1963
59
Breakdown by US/non-US and rights status, pre-1923, 1923- 1963 and 1964-1977
60
Breakdown by US/non-US and rights status for all periods
61
Collaboration: Preservation
62
Technology - OAIS GRIN Internal Data Loading GRIN Internal Data Loading Google Internet Archive In-house Conversion Google Internet Archive In-house Conversion MARC record extensions (Aleph) Rights DB MARC record extensions (Aleph) Rights DB Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS object PNG OCR PDF METS object PNG OCR PDF Isilon Site Replication TSM MD5 checksum validation Isilon Site Replication TSM MD5 checksum validation GROOVE (JHOVE) GROOVE (JHOVE) ; Technology
63
How to find out more Web site About section: http://www.hathitrust.org/about http://www.hathitrust.org/about Twitter: http://twitter.com/hathitrust http://twitter.com/hathitrust RSS: http://www.hathitrust.org/updates_rsshttp://www.hathitrust.org/updates_rss Monthly newsletter: http://www.hathitrust.org/updates http://www.hathitrust.org/updates Contact us: feedback@issues.hathitrust.org feedback@issues.hathitrust.org Soon: Facebook, blog
64
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.