HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust.

Slides:



Advertisements
Similar presentations
HathiTrust Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License.
Advertisements

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
HathiTrust: Building the Universal Collection John Wilkin 18 May 2009.
This Library Never Forgets Preservation, Cooperation, and the Making of HathiTrust Digital Library Jeremy York Project Librarian HathiTrust Digital Library.
Building the Universal Library: The Promise and Challenges of HathiTrust John Wilkin 2 April 2009.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
HATHI TRUST A Shared Digital Repository HathiTrust 101 John Wilkin and Jeremy York August 27, 2010.
Building the Universal Library: Introducing HathiTrust Patricia A. Steele Indiana University Libraries John Price Wilkin University of Michigan Libraries.
HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
National Institutes of Health U.S. Department of Health and Human Services The PEPH Resource Center: A New, More Convenient Login.
HATHITRUST A Shared Digital Repository Update on Developments and Activities UM Selectors October 9, 2012 Jeremy York, Project Librarian, HathiTrust.
HathiTrust and the Ecology of Shared Collections Paul N. Courant 21 May 2009.
HATHITRUST A Shared Digital Repository We’re Preserving the Past, What About the Present? NISO Webinar: Ensuring the Preservation of E-Books May 23, 2012.
What’s Next for HathiTrust?. We’re Growing Up! Partnership Arizona State University Baylor University Boston University California Digital Library Columbia.
HATHITRUST A Shared Digital Repository HathiTrust current work, challenges, and opportunities for public libraries Creating a Blueprint for a National.
HATHITRUST A Shared Digital Repository HathiTrust as a Model for Preservation and Access Jeremy York Media Preservation Conference April 17, 2013.
HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise.
HATHITRUST A Shared Digital Repository Bibliographic Metadata and HathiTrust ALCTS CaMMS Catalog Management Interest Group Meeting American Library Association.
HATHITRUST A Shared Digital Repository Collective Stewardship through HathiTrust Digital Library African Studies in the Digital Age November 12, 2014 Mike.
HATHITRUST A Shared Digital Repository HathiTrust METS and PREMIS October 25, 2011 Jeremy York Project Librarian, HathiTrust.
HATHITRUST A Shared Digital Repository HathiTrust on the Move A Growing Partnership Taking Stock and Looking Ahead National Library of Medecine October.
HATHITRUST A Shared Digital Repository HathiTrust: A Second Life for Library Collections Jeremy York Exploring Humanities Cyberinfrastructure April 30,
HATHITRUST A Shared Digital Repository Preservation with a Purpose: End User Access Services in HathiTrust Jeremy York Rutgers University February 24,
HATHITRUST A Shared Digital Repository HathiTrust: The Collection and Its Uses NEFLIN Webinar - November 7, 2013 Jeremy York, Assistant Director, HathiTrust.
HATHITRUST A Shared Digital Repository A Preservation Infrastructure Built to Last: Preservation, Community, and HathiTrust UNESCO Memory of the World.
HATHITRUST A Shared Digital Repository How Can Digital Collections Support Shared Print Initiatives? The HathiTrust Print Monograph Archive Planning Task.
HATHITRUST A Shared Digital Repository Big Collections in an Era of Big Copyright: Practical Strategies for Making the Most of Digitized Heritage Jeremy.
HATHITRUST A Shared Digital Repository HathiTrust Overview: Partnership and Services Jeremy York Wesleyan University Web Presentation February 18, 2014.
HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust.
HATHITRUST A Shared Digital Repository Why Digitize? or The Limits of Preservation 2014 TEI/DHCS Plenary Session Evanston, IL Mike Furlough Executive Director,
HATHITRUST A Shared Digital Repository Digital Humanities in HathiTrust: Research At Any Scale Jeremy York Digital Humanities and the Futures of Japanese.
HATHITRUST A Shared Digital Repository HathiTrust Past, Present, and Future A Brief Introduction.
HATHITRUST A Shared Digital Repository More, Better, Together: HathiTrust Accomplishments and Aspirations The Researcher of Tomorrow Universidad Complutense.
HATHITRUST A Shared Digital Repository HathiTrust Data In Detail September 11, 2012 Jeremy York Project Librarian, HathiTrust.
HathiTrust – How To By Dr. Rob McGeachin 20 th Annual AgNIC Meeting May 7, 2015.
CILogon and InCommon: Technical Update Jim Basney This material is based upon work supported by the National Science Foundation under grant numbers
HATHITRUST A Shared Digital Repository HathiTrust: Putting Research in Context HTRC UnCamp September 10, 2012 John Wilkin, Executive Director, HathiTrust.
HATHITRUST A Shared Digital Repository Collaborating Globally, Planning Locally HathiTrust and New Opportunities in Collection Management GWLA/UNM: Emerging.
1 The Partnership Challenge Higher education’s missions are realized in increasingly global, collaborative, online relationships –Higher educations’ digital.
HATHITRUST A Shared Digital Repository HathiTrust Infrastructure and Information Organization November 7, 2011 Jeremy York Project Librarian, HathiTrust.
June, 2012 Art Mandel.  Multiple acceptances to Ivy League Schools  Multiple acceptances to the “Most Competitive” colleges and universities  State.
HATHITRUST A Shared Digital Repository HathiTrust: Key Concepts and Issues in Managing the Digital Archive ICPSR Summer Workshop “Curating and Managing.
Breana McCracken University of Illinois at Urbana-Champaign HathiTrust and Copyright Future Implications - Strong precedent for libraries to continue to.
HATHITRUST A Shared Digital Repository HathiTrust and TRAC DigitalPreservation 2012 July 25, 2012 Jeremy York, Project Librarian, HathiTrust.
H ATHI T RUST HTTP :// WWW. HATHITRUST. ORG Large-Scale Digital Initiatives and their potential impact on the Maine Shared Collections Strategy Colby College.
Harrison’s Top 25 1.Florida State 2.Alabama 3.Oregon 4.Oklahoma 5.South Carolina 6.Michigan State 7.Ohio State 8.Auburn 9.Baylor 10.Georgia 11.UCLA 12.LSU.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
Author(s): Jeremy York, 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution–Noncommercial–Share.
HATHITRUST A Shared Digital Repository HathiTrust and the Future of Research Libraries American Antiquarian Society March 31, 2012 Jeremy York, Project.
HATHITRUST A Shared Digital Repository Your Library, Now Online! Putting HathiTrust in the Context of Traditional (and New) Library Services MCLS Webinar.
HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010.
HATHITRUST A Shared Digital Repository Institution Uses of HathiTrust Jeremy York University of Maine May 24, 2013.
HathiTrust: Collaboration in Building the Universal Collection John Wilkin 1 October 2009.
HATHITRUST A Shared Digital Repository HathiTrust Large Digital Libraries: Beyond Google Books Modern Language Association January 5, 2012 Jeremy York,
Barbara Preece ICOLC, April Mark Sandler Center for Library Initiatives Chicago Illinois Indiana Iowa Michigan Michigan State Minnesota Northwestern.
An Overview of the Platform
Collaboration: to work jointly with others towards a common goal Or the whole is greater than the sum of its parts Lisa B. German Library Faculty Organization.
DPLAfest, April 15, 2016 Chip German, Program Director, APTrust and Senior Director, Content Stewardship, at the University of Virginia Library
HathiTrust: A valuable and visionary Partnership.
HathiTrust--a GovDocs Repository? Brian Vetruba, Catalog Librarian/Germanic Studies Librarian Washington University in St. Louis Leveraging.
HATHITRUST A Shared Digital Repository ALA CopyTalk: CRMS The Copyright Review Management System September 1, 2016 Melissa Levine, Lead Copyright Officer,
HathiTrust Digital Library Interface and Services
Faculty Salary Study Comparison to AAU Data Exchange Institutions
HathiTrust Copyright Review
Building the Universal Library: Introducing HathiTrust
Metadata to fit your needs... How much is too much?
HathiTrust And Its Research Center
From Innovation to Commercialization Access to Data
Presentation transcript:

HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License.Creative Commons Attribution Unported License

Outline Introduction Underlying Ideas Repository and Services

Introduction

HathiTrust Members Allegheny College American University of Beirut Arizona State University Auburn University Baylor University Boston College Boston University Brandeis University Brown University California Digital Library Carnegie Mellon University Case Western Reserve Colby College Columbia University Cornell University Dartmouth College Duke University Emory University Getty Research Institute Georgetown University Georgia Tech Harvard University Library Indiana University Iowa State University Johns Hopkins University Kansas State University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University Montana State University Mount Holyoke College New York Public Library New York University North Carolina Central University North Carolina State University Northeastern University Northwestern University The Ohio State University Oklahoma State University Penn State Princeton University Purdue University Rutgers University Stanford University State University System of Florida Swarthmore College Syracuse University Temple University Texas A&M University Texas Tech Tufts University Universidad Complutense de Madrid University of Alabama University of Alberta University of Arizona University of British Columbia University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Connecticut University of Delaware University of Houston University of Illinois University of Illinois at Chicago The University of Iowa University of Kansas University of Maine University of Maryland University of Massachusetts, Amherst University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln University of New Mexico The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahoma University of Pennsylvania University of Pittsburgh University of Queensland University of Tennessee, Knoxville University of Texas University of Utah University of Vermont University of Virginia University of Washington University of Wisconsin- Madison Utah State University Vanderbilt University Virginia Tech Wake Forest University Washington University Yale University Library

Digital Repository Launched 2008 Initial focus on digitized book and journal content – 13.3 million total volumes – 6.7 million book titles – 350,000 serial titles – 5 million public domain (~38%)

The Name The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy

Mission To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge

Universal Library Common Goal Single Entity, Many Partners HathiTrust

Collections and Collaboration Comprehensive collection -Preservation…with Access ]Shared strategies – Copyright – Collection management, development – Preservation – Discovery / Use – Bibliographic Indeterminacy – Efficient user services Public Good

Content

1. Michigan4,712, California3,612, Harvard838, Wisconsin561, Indiana529, Cornell510, Penn State388, Illinois329, NYPL294, Princeton252, Minnesota193, Madrid117, Library of Congress 108, Keio University90,112

Dates

Language Distribution (1) The top 10 languages make up ~87% of all content

Language Distribution (2) The next 40 languages make up ~12% of total

HathiTrust and other e-databases

Content Distribution

Underlying Ideas

Underlying ideas Community Scale Access and Preservation Openness

Community

OAIS TRAC METS and PREMIS Repository Practices – Content – Reference – Fixity

Scale Mission – To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge Strategy – “Co-owned and managed”

Preservation and Access We engage in preservation for purposes of access “Light” archive benefits – Access to materials – Checks on integrity – Best chance for content to be used and valued, preserved

Openness Repository centralized...open Formats Software Organizational structure

Underlying ideas

Experience

What’s Missing? What should be included in the AIP? What should be validated? How should content be identified? How to operate at scale – managing preservation information (PREMIS; access information in rational way at scale)...

Repository Philosophy/Design OAIS/TRAC Consistency Standardization Simplicity (in design, not function) Practicality Sustainability

Source Bibliographic Data Content Package Michigan Indiana Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets TDR

Building the Digital Repository Shared infrastructure – Centralized Administration: Ingest, validation, content integrity Functionality: full-text search, viewing print on demand – Geographically distributed In terms of location, coding, service development, digitization, content preparation

Source Bibliographic Data Content Package Michigan Indiana Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets

Content Selection of content for digitization and preservation Types of materials Technology – Largely uniform in technical characteristics – 3 formats ITU G4 TIFF JPEG2000 Unicode (with and without coordinates)

Content Package images Source METS text HT METS Zip

Source Bibliographic Data Content Package Ingest Rigorous validation to ensure conformance with specifications: Resolution, image metadata Barcode Fixity Consistency Well-formedness Prepare archival package

Source Bibliographic Data Content Package Ingest More about ingest New Digitization Existing Digitization Ingest checklist: Deposit Forms Bibliographic metadata specifications Ingest tools Tools for validating, remediating, packaging Detailed content specifications Deposit Guidelines Policies Example METS files and METS profile ations ations

Source Bibliographic Data Content Package Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets Michigan Indiana

Bib Data Data Management Rights Data Holdings Data Bibliographic Data Inventory Loading and updating records Duplicate detection and collation Source of information for VuFind catalog, APIs Rights determination (automated and support for manual review)

Bib Data Data Management Rights Data Holdings Data namespaceidattrreaso n sourceusertimenote Inu Jhovater :30:23 NULL

idnametypedscr 1pdcopyrightpublic domain 2iccopyrightin-copyright 3opbcopyrightout-of-print and brittle (implies in-copyright) 4orphcopyrightcopyright-orphaned (implies in-copyright) 5undcopyrightundetermined copyright status 6umallaccessavailable to UM affiliates and walk-in patrons (all campuses) 7worldaccessavailable to everyone in the world 8nobodyaccessavailable to nobody; blocked for all users 9pduscopyrightpublic domain only when viewed in the US 10cc-bycopyrightCreative Commons Attribution 11cc-by-ndcopyrightCreative Commons Attribution-NoDerivatives 12cc-by-nc-ndcopyrightCreative Commons Attribution-NonCommercial-NoDerivatives 13cc-by-nccopyrightCreative Commons Attribution-NonCommercial 14cc-by-nc-sacopyrightCreative Commons Attribution-NonCommercial-ShareAlike 15cc-by-sacopyrightCreative Commons Attribution-ShareAlike 16orphcandcopyrightorphan candidate - in 90-day holding period (implies in-copyright) 17cc-zerocopyrightCreative Commons Zero license (implies pd) 18und-worldcopyright Undetermined copyright status and permitted as world-viewable by the depositor 19Ic-uscopyrightIn copyright in the US Rights Attributes 39

Rights Determination Reason Codes idnamedscr 1bibbibliographically-derived by automatic processes 2ncnno printed copyright notice 3concontractual agreement with copyright holder on file 4ddddue diligence documentation on file 5manmanual access control override; see note for details 6pvtprivate personal information visible 7rencopyright renewal research was conducted 8nfineeds further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered) 9cdpptitle page or verso contain copyright date and/or place of publication information not in bib record 10cipcondition review and in-print status research was conducted 11unpunpublished work 12gfvGoogle viewability set at VIEW_FULL 13crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details 14add author death date research was conducted or notification was received from authoritative source 15exp expiration of copyright term for non-US work with corporate author 16DelDeleted from repository; see note for details 17GattNon-US public domain work restored to in-copyright in the US by GATT 40

Access Determinations Automated Manual

Automatic Rights Determination Conducted on all works at time of ingest and when records are modified – Public domain worldwide US works published before 1923, US federal government publications, non-US works published prior to 1873 – Public domain in the United States Non-US works published prior to 1923

Manual Rights Determination IMLS-funded CRMS project – CRMS-US 2008: US-published works Staff at 4 partner institutions – CRMS-World 2011: Expanded to non-US works Staff at 16 partner institutions – Double review with additional expert review for conflicts – Compliance with copyright formalities – As of March ,520 reviewed, 270,979 opened Rights Holder Permissions

System of Precedence Rights Database Bibliographic (automatic) Manual

Bib Data Data Management Rights Data Holdings Data Single-part monographs OCLC #; Local system ID; Timestamp; Holding Status; Condition Multi-part monographs Include enumeration and chronology Serials OCLC #; Local system ID; Timestamp; ISSN

Source Bibliographic Data Content Package Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets Michigan Indiana

Storage Michigan Indiana Reliability – ensure integrity Redundancy – in single and multiple sites Scalability – including ease of management Accessibility – for repository processes and services Platform-independence – for data/object management

Storage Michigan Indiana EMC Isilon storage Disk-based Load-balancing and fail-over Internal redundancy (N+3) Efficient, reliable replication (daily) Scalable (single file system up to 5 petabytes)

Storage Michigan Indiana Object integrity Continual checks on data integrity Detection and repair of corrupt disk sectors Fixity checks on ingest Periodic checks on fixity of all objects

Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml

Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml

Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml

Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml Example ids: wu mdp uc2.ark:/1390/t miua.aaj

Architecture & Management Reference – Ability to locate objects definitively and reliably over time among other objects (Task Force on Archiving of Digital Information, 1996) – Identification of objects – Structure of the repository – Embedding of identifiers – Permanent URLs – Version dates

Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml

What is METS? Metadata Encoding and Transmission Standard Administrative (including preservation), Technical, and Structural metadata

Why METS? Can serve as Archival Information Package and a Dissemination Information Package Designed to record the relationship between pieces of complex digital objects Can be created automatically as texts are loaded or reloaded Preservation actions (PREMIS)

Metadata Framework Details and specifications at repository level – Object specifications / Validation criteria – Page-tagging Variations at object level – Files missing – Non-valid files – Incorrect file checksums

Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml

Object Entity identifier dul1.ark:/13960/t13n2vj0t file count 960 page count 320 Event Entity UUID 9af6a994-f6fe-3a61-ac0e-be793d347edb package inspection T20:37:51Z Inspection of download package for missing files warning files missing islandoradventur00whit_scanfactors.xml MARC21 Code MiU Executor tool feedd.pl software PREMIS Metadata

captureInitial capture (digitization) of item file renameFile renaming to HathiTrust conventions image modificationReplace boilerplate images with blank images image compressionConversion of raw scans to compressed TIFF and JPEG2000 image header modification Modification of image headers to meet HathiTrust conventions ingestionIngestion of object package into the repository message digest calculation Calculation of page-level MD5 checksums (refers to checksum calculations performed prior to content submission to HathiTrust when these checksums are available) validationValidation of technical characteristics of image and OCR files ocr splitDetail is package type specific, e.g.: a) Extraction of plain-text OCR from ALTO XML b) Split OCR into one plain text OCR file per page c) Splitting of IA XML OCR into one plain text OCR file and one XML file (with coordinates) per page package inspectionInspection of download package for missing files page feature mappingMapping of original page feature tags to HathiTrust tags fixity checkValidation of MD5 checksums of content files zip archive creationCompression of content files and source METS into zip archive zip file message digest calculation Calculation of md5 checksum for zip archive source mets creationCreation of source METS file

Provenance Strategies – Original source – Agent of digitization – Administrative metadata (provenance and preservation)

Security Data Integrity – Checksum validation, digital object provenance Physical security – Biometric door systems, locked racks Network security – Firewalling, vulnerability scanning Application security – Developer best practices, input validation Access control

Authentication Shibboleth – Login with organization – Attributes released to Service Provider – Authorize access –

Source Bibliographic Data Content Package Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets Michigan Indiana

APIs Bibliographic API – Volume and rights information – MARC records – OAI – “Hathifiles” – Data API – Volume and rights information – Page images – OCR –

Computational Access Distribution of datasets – Non-Google-digitized Dataset (540,000+) – PD, PDUS, Open Access – Signed researcher statement Google-digitized (4.8 million+) – PD, PDUS, Open Access – Agreement between institution and Google – Brief proposal Characterize texts Provide ids (custom sets possible) Research, results, use of results – Signed researcher statement

HTRC HathiTrust Research Center – Developed collaboratively by Indiana University and University of Illinois; launched July 2011 – Enables computational access to public domain and open access materials; working to support in-copyright materials as well – Secure Environment – bring researchers to the data – Build services and tools that facilitate research by digital humanities and informatics communities – Advanced Collaborative Support RFP: Awards:

How to find out more About: Twitter: Facebook: Monthly newsletter: – – RSS Contact us: Blogs: – Large-scale Search – Perspectives from HathiTrust Resources – A Preservation Infrastructure Built to Last: Preservation, Community, and HathiTrust – PREMIS 2.0 Implementation:

Thank you!