The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Tweet Us: #HTRC #SESS037 #EDU13.

Slides:



Advertisements
Similar presentations
HathiTrust Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License.
Advertisements

HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
HATHI TRUST A Shared Digital Repository HathiTrust Digital Library Is There A Past In Your Future? Princeton University February 2010.
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
HathiTrust: Building the Universal Collection John Wilkin 18 May 2009.
This Library Never Forgets Preservation, Cooperation, and the Making of HathiTrust Digital Library Jeremy York Project Librarian HathiTrust Digital Library.
Building the Universal Library: The Promise and Challenges of HathiTrust John Wilkin 2 April 2009.
HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011.
National Institutes of Health U.S. Department of Health and Human Services The PEPH Resource Center: A New, More Convenient Login.
HathiTrust and the Ecology of Shared Collections Paul N. Courant 21 May 2009.
HATHITRUST A Shared Digital Repository We’re Preserving the Past, What About the Present? NISO Webinar: Ensuring the Preservation of E-Books May 23, 2012.
What’s Next for HathiTrust?. We’re Growing Up! Partnership Arizona State University Baylor University Boston University California Digital Library Columbia.
HATHITRUST A Shared Digital Repository HathiTrust current work, challenges, and opportunities for public libraries Creating a Blueprint for a National.
IMLS National Leadership Grant: CRMS World Bobby Glushko University of Michigan Copyright Office.
HATHITRUST A Shared Digital Repository HathiTrust as a Model for Preservation and Access Jeremy York Media Preservation Conference April 17, 2013.
Brad Wheeler Kuali Foundation, Chair Vice President for IT, Dean, & Professor Indiana University.
HATHITRUST A Shared Digital Repository Bibliographic Metadata and HathiTrust ALCTS CaMMS Catalog Management Interest Group Meeting American Library Association.
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.
HATHITRUST A Shared Digital Repository Collective Stewardship through HathiTrust Digital Library African Studies in the Digital Age November 12, 2014 Mike.
HathiTrust Research Center Architecture
HATHITRUST A Shared Digital Repository HathiTrust METS and PREMIS October 25, 2011 Jeremy York Project Librarian, HathiTrust.
HATHITRUST A Shared Digital Repository HathiTrust on the Move A Growing Partnership Taking Stock and Looking Ahead National Library of Medecine October.
HATHITRUST A Shared Digital Repository HathiTrust: A Second Life for Library Collections Jeremy York Exploring Humanities Cyberinfrastructure April 30,
May 17, 2011 DPLA Global Interoperability and Linked Data Workshop Building a Public Research Center for the HathiTrust Digital Library Robert H. McDonald.
HATHITRUST A Shared Digital Repository HathiTrust: The Collection and Its Uses NEFLIN Webinar - November 7, 2013 Jeremy York, Assistant Director, HathiTrust.
HATHITRUST A Shared Digital Repository A Preservation Infrastructure Built to Last: Preservation, Community, and HathiTrust UNESCO Memory of the World.
HATHITRUST A Shared Digital Repository How Can Digital Collections Support Shared Print Initiatives? The HathiTrust Print Monograph Archive Planning Task.
HATHITRUST A Shared Digital Repository Big Collections in an Era of Big Copyright: Practical Strategies for Making the Most of Digitized Heritage Jeremy.
HATHITRUST A Shared Digital Repository HathiTrust Overview: Partnership and Services Jeremy York Wesleyan University Web Presentation February 18, 2014.
HATHITRUST A Shared Digital Repository Why Digitize? or The Limits of Preservation 2014 TEI/DHCS Plenary Session Evanston, IL Mike Furlough Executive Director,
HATHITRUST A Shared Digital Repository Digital Humanities in HathiTrust: Research At Any Scale Jeremy York Digital Humanities and the Futures of Japanese.
Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald Associate Dean for Library Technologies Deputy.
The Hathi Trust Research Center and tool builders John Unsworth (with Beth Plale, Scott Poole, Robert McDonald, and others) Project Bamboo Corpora Space.
Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012.
HATHITRUST A Shared Digital Repository HathiTrust Past, Present, and Future A Brief Introduction.
HATHITRUST A Shared Digital Repository More, Better, Together: HathiTrust Accomplishments and Aspirations The Researcher of Tomorrow Universidad Complutense.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
High Water Raises All Boats Leveraging Partnerships on Campus to Build a Repository Mary Molinaro University of Kentucky Libraries.
CILogon and InCommon: Technical Update Jim Basney This material is based upon work supported by the National Science Foundation under grant numbers
HATHITRUST A Shared Digital Repository HathiTrust: Putting Research in Context HTRC UnCamp September 10, 2012 John Wilkin, Executive Director, HathiTrust.
HATHITRUST A Shared Digital Repository Collaborating Globally, Planning Locally HathiTrust and New Opportunities in Collection Management GWLA/UNM: Emerging.
1 The Partnership Challenge Higher education’s missions are realized in increasingly global, collaborative, online relationships –Higher educations’ digital.
HATHITRUST A Shared Digital Repository HathiTrust Infrastructure and Information Organization November 7, 2011 Jeremy York Project Librarian, HathiTrust.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
HathiTrust Research Center Dedicated to provision of computational access to comprehensive body of published works for scholarship and education.
HATHITRUST A Shared Digital Repository HathiTrust: Key Concepts and Issues in Managing the Digital Archive ICPSR Summer Workshop “Curating and Managing.
HTRC Workshop 101 THATCamp Gainesville April 24, 2014.
Breana McCracken University of Illinois at Urbana-Champaign HathiTrust and Copyright Future Implications - Strong precedent for libraries to continue to.
HATHITRUST A Shared Digital Repository HathiTrust and TRAC DigitalPreservation 2012 July 25, 2012 Jeremy York, Project Librarian, HathiTrust.
HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data.
Collection and Data Overview Jeremy York Stacy Kowalczyk.
HathiTrust Research Center Architecture Data subsystem.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
Author(s): Jeremy York, 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution–Noncommercial–Share.
Accessing HTRC Data. What is Hathitrust Research Center? A collaborative research center launched jointly by Indiana University and the University of.
HATHITRUST A Shared Digital Repository HathiTrust and the Future of Research Libraries American Antiquarian Society March 31, 2012 Jeremy York, Project.
HATHITRUST A Shared Digital Repository Your Library, Now Online! Putting HathiTrust in the Context of Traditional (and New) Library Services MCLS Webinar.
HATHITRUST A Shared Digital Repository Institution Uses of HathiTrust Jeremy York University of Maine May 24, 2013.
HathiTrust: Collaboration in Building the Universal Collection John Wilkin 1 October 2009.
HATHITRUST A Shared Digital Repository HathiTrust Large Digital Libraries: Beyond Google Books Modern Language Association January 5, 2012 Jeremy York,
Collaboration: to work jointly with others towards a common goal Or the whole is greater than the sum of its parts Lisa B. German Library Faculty Organization.
The Data Capsule for Non-Consumptive Research Beth Plale, Atul Prakash, Geoffrey Fox, Robert H. McDonald A Proposal to the Alfred P. Sloan Foundation HTRC.
Presenters:Lea Domingo, Branch Manager, Kahuku Public and School Library Sunny Pai, Digital Initiatives Librarian, Kapiolani Community College If you.
HathiTrust: A valuable and visionary Partnership.
HATHITRUST A Shared Digital Repository ALA CopyTalk: CRMS The Copyright Review Management System September 1, 2016 Melissa Levine, Lead Copyright Officer,
HathiTrust Digital Library Interface and Services
Faculty Salary Study Comparison to AAU Data Exchange Institutions
HathiTrust Copyright Review
HathiTrust And Its Research Center
From Innovation to Commercialization Access to Data
Presentation transcript:

The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Tweet Us: #HTRC #SESS037 #EDU13

The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Robert H. McDonald – Indiana University Beth Sandore Namachchivaya – University of Illinois John Unsworth – Brandeis University Educause Annual Meeting Anaheim, CA October 16, 2013 Tweet Us: #HTRC #SESS037 #EDU13

Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust Partnership Allegheny College Arizona State University Baylor University Boston College Boston University California Digital Library Carnegie Mellon University Columbia University Cornell University Dartmouth College Duke University Emory University Florida State University Getty Research Institute Harvard University Library Indiana University Iowa State University Johns Hopkins University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University New York Public Library New York University North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Syracuse University Texas A&M University Tufts University Universidad Complutense de Madrid University of Alabama University of Arizona University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Connecticut University of Delaware University of Florida University of Illinois at Chicago The University of Iowa University of Maryland University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahama University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Virginia Tech Wake Forest University Washington University Yale University Library

Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust Mission To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge

Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust Services Long-term preservation – Bit-level and migration Bibliographic search Full-text search Reading and download capabilities Print on demand Collections Datasets HathiTrust Research Center

Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust “Wow” Numbers 10,819,596 total volumes 5,672,046 book titles 281,890 serial titles 3,786,858,600 pages 485 terabytes 128 miles 8,791 tons 3,469,225 volumes(~32% of total) in the public domain

Tweet Us: #HTRC #SESS037 #EDU13 Discovery and Use Search, collections, online access APIs and data feeds – Data API – Bibliographic API – “Hathifiles” inventory files – OAI Computational Research – Distribution of datasets – Protocol-based access – Research Center

Tweet Us: #HTRC #SESS037 #EDU13 Research Center in Context

Tweet Us: #HTRC #SESS037 #EDU13 Goals for HTRC Provide a persistent and sustainable structure to enable scholars to ask and answer new questions. – Leverage data storage and computational infrastructure at Indiana & Illinois – Stimulate community development of new functionality and tools – Use tools to enable discoveries that would not be possible without the HTRC Enable scholars to fully utilize content of HathiTrust Library while preventing intellectual property misuse within U.S. copyright law. – Provide a secure computational and data environment for scholars to perform research using HathiTrust Digital Library.

Tweet Us: #HTRC #SESS037 #EDU13 Board of Governors Executive Committee Executive Director HathiTrust University of Illinois Indiana University HathiTrust Research Center University of Michigan Data Copy #1 Data Copy #2

Tweet Us: #HTRC #SESS037 #EDU13 HTRC Governance Reports to the HathiTrust Board of Governors HTRC Executive Committee – J. Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-director and Chair), Director Data To Insight Center and professor in the School of Informatics and Computing at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis University HTRC Advisory Board (See members next slide) Google Public Domain agreement – in place for IU and UIUC

Tweet Us: #HTRC #SESS037 #EDU13 HTRC Advisory Board Cathy Blake, University of Illinois, Urbana-Champaign Beth Cate, Indiana University Greg Crane, Tufts University Laine Farley, California Digital Library Brian Geiger, University of California at Riverside David Greenbaum, University of California at Berkeley Fotis Jannidis, University of Wurzberg, Germany Matthew Jockers, Stanford University Jim Neal, Columbia University Bill Newman, Indiana University Bethany Nowviskie, University of Virginia Andrey Rzhetsky, University of Chicago Pat Steele, University of Maryland Craig Stewart, Indiana University David Theo Goldberg, University of California at Irvine John Towns, National Center for Supercomputing Applications Madelyn Wessel, University of Virginia

Data Overview

Tweet Us: #HTRC #SESS037 #EDU13 Hathifiles Tab-delimited inventory files Aggregated monthly Daily incremental files Contain – Identifiers – Limited bibliographic information – Rights, language, gov docs status information

Tweet Us: #HTRC #SESS037 #EDU13 Content Distribution

Tweet Us: #HTRC #SESS037 #EDU13 Content Sources

Tweet Us: #HTRC #SESS037 #EDU13 Dates

Tweet Us: #HTRC #SESS037 #EDU13 Language Distribution The top 10 languages make up ~86% of all content

Data Availability

Tweet Us: #HTRC #SESS037 #EDU13 Source Bibliographic Data Content Package Indiana Michigan Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets

Tweet Us: #HTRC #SESS037 #EDU13 How is it available? Web interfaces APIs – Data API – Bib API Data feeds and distribution – Hathifiles – OAI – Datasets Soon: Virtual Machines

Copyright

Tweet Us: #HTRC #SESS037 #EDU13 Copyright Strongly bound to US copyright issues with constant vigilance of the international scene Status determinations via: – Bibliographic metadata – Automatic and manual rights determination

Tweet Us: #HTRC #SESS037 #EDU13 Automatic Rights Determination Conducted on all works at time of ingest and when records are modified – Public domain worldwide US works published before 1923, US federal government publications, non-US works published prior to 1872 – Public domain in the United States Non-US works published prior to 1923

Tweet Us: #HTRC #SESS037 #EDU13 Manual Rights Determination IMLS-funded CRMS project – US-published works – Conformance with formalities – Expanding to non-US works – Double-blind review with expert review for conflicts – Staff at 4 HathiTrust partner institutions (15 will take part in non-US) – As of February 2012 ~190,000 reviewed, more than 100,000 opened Rights Holder Permissions

idnametypedscr 1pdcopyrightpublic domain 2iccopyrightin-copyright 3opbcopyrightout-of-print and brittle (implies in-copyright) 4orphcopyrightcopyright-orphaned (implies in-copyright) 5undcopyrightundetermined copyright status 6umallaccessavailable to UM affiliates and walk-in patrons (all campuses) 7worldaccessavailable to everyone in the world 8nobodyaccessavailable to nobody; blocked for all users 9pduscopyrightpublic domain only when viewed in the US 10cc-bycopyrightCreative Commons Attribution 11cc-by-ndcopyrightCreative Commons Attribution-NoDerivatives 12cc-by-nc-ndcopyrightCreative Commons Attribution-NonCommercial-NoDerivatives 13cc-by-nccopyrightCreative Commons Attribution-NonCommercial 14cc-by-nc-sacopyrightCreative Commons Attribution-NonCommercial-ShareAlike 15cc-by-sacopyrightCreative Commons Attribution-ShareAlike 16orphcandcopyrightorphan candidate - in 90-day holding period (implies in-copyright) 17cc-zerocopyrightCreative Commons Zero license (implies pd) 18und-worldcopyright Undetermined copyright status and permitted as world-viewable by the depositor 19Ic-uscopyrightIn copyright in the US Rights Attributes

Rights Determination Reason Codes idnamedscr 1bibbibliographically-derived by automatic processes 2ncnno printed copyright notice 3concontractual agreement with copyright holder on file 4ddddue diligence documentation on file 5manmanual access control override; see note for details 6pvtprivate personal information visible 7rencopyright renewal research was conducted 8nfineeds further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered) 9cdpptitle page or verso contain copyright date and/or place of publication information not in bib record 10cipcondition review and in-print status research was conducted 11unpunpublished work 12gfvGoogle viewability set at VIEW_FULL 13crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details 14add author death date research was conducted or notification was received from authoritative source 15exp expiration of copyright term for non-US work with corporate author 16DelDeleted from repository; see note for details 17GattNon-US public domain work restored to in-copyright in the US by GATT

Tweet Us: #HTRC #SESS037 #EDU13 Type of work Searchable (bibliographic and full-text) Viewable*Full-PDF download (Data API) Print on Demand Print disabilities* Preservation uses (Section 108)* Public domain worldwide Worldwide Partners only if scanned by Google, if not, worldwide. WorldwidePartners worldwide N/A Public domain (US) – Non-US works published between 1872 and WorldwideWhen accessed from with the United States Partners in the US if scanned by Google, if not, anyone US Available within the United States Partners in the US; partners worldwide where similar laws in effect N/A Works that rights holders have opened access to in HathiTrust Worldwide Worldwide (if digitized by Google, full-PDF only available if opened with CC license) Worldwide with permission Partners worldwide N/A Works that are in-copyright or of undetermined status WorldwideNot available Partners in the US; partners worldwide where similar laws in effect Partners in the US; partner worldwide where similar laws in effect Orphan worksWorldwideTo participating partners Not available Partners in the US Partners in the US; partners worldwide where similar laws in effect * Note: Access to in-copyright works is subject to conditions on Terms of Access slide. See here also.here

HTRC Research Paradigm

Bring the COMPUTATION to the DATA!

Web services architecture and protocols Registry of services and algorithms Solr full text indexes noSQL store as volume store openID authentication Portal front-end, programmatic access Data mining algorithms

Tweet Us: #HTRC #SESS037 #EDU13 Agent framework Page/volume tree (file system) Volume store (Cassandra) SEASR analytics service Task deployment WSO2 registry services, collections, data capsule images Solr index HathiTrust corpus rsync HTRC Data API v0.1 NCSA local resources Programmatic access e.g., WS02 Identity Server University of Michigan Meandre Orchestration Agent instance Non-consumptive Data capsules Big Red II/IU Quarry 33 Blacklight Volume store (Cassandra) Volume store (Cassandra) NSF XSEDE Portal

HTRC Complexity hiding interface All the complexity Tabular info Statistical plots Spatial plots Request

Complexity hiding interface Other data (dictionaries, wiki data) Subsets of corpus HTRC Text mining algorithms

Tweet Us: #HTRC #SESS037 #EDU13 VM Image Manager VM Image Store VM Image Builder VM Manager VM instance Secure Virtual Cloud SSH Non-consumptive Output Storage Researcher HTRC Research Access Request for VM

Tweet Us: #HTRC #SESS037 #EDU Select volumes for analysis 2 2 Select algorithm 3 3 View/download results Named Entities Word frequencies Topic models

Research Engagements

1315 volumes selected using a keyword search for ‘Darwin', ‘Romanes', 'anthropomorphism', and 'comparative psychology’. This set contains lots of books that are not of particular interest -- e.g., books on theology, college course catalogs. Challenge: Find the philosophical arguments in haystack of sentences Colin Allen Professor, Cognitive Science Indiana University Digging into Data

Yearly values of ratio between two wordlists in three different genres. 4,275 volumes Ted Underwood, Dept of English, UIUC

Tweet Us: #HTRC #SESS037 #EDU13 Phenotypes implemented at level of genes General study: understanding of how phenotypes, such as human healthy diversity and maladies, are implemented at level of genes. Why HTRC: capture properties of language automatically -- for text transformations and information extraction. Generalize grammatical and idiomatic patterns as related to systems biology. Andrey Rzhetsky Professor, Department of Medicine University of Chicago

Tweet Us: #HTRC #SESS037 #EDU13 Other Grants and Proposals involving HTRC Zdenek Zdrahal, “DiscoveryCORE, Discovering Hidden Relationships in Semantically Connected Resources”, NEH Digging Into Data Challenge. Matthew Wilken, NotreDame, “Literary Geography at Scale”, American Council of Learned Societies (ACLS). Ichiro Fujinaga, “Single Interface for Music Score Searching and Analysis (SIMSSA)” to SSHRC, Canada. Pending. Andrew Piper, Text Mining the Novel: Establishing the Foundations of a New Discipline, SSHRC, Canada. Robert Liffe, University of Sussex, Textual Genomics Project (TTGP), United Kingdom Arts and Humanities Research Council. Edie Rasmussen. From Indexer’s Legacy to Scholar’s Desktop. Adam Farquhar, The British Library. IRIS, Arts and Humanities Research Council grant.

Tweet Us: #HTRC #SESS037 #EDU13 Workset Creation for Scholarly Analysis Funded at $493,000 by the Andrew W. Mellon Foundation; Co-PIs: J. Stephen Downie, Tim Cole, Beth Plale; 1 July June Goals: 1)enriching the metadata in the HathiTrust corpus 2)augmenting string-based metadata with URIs to leverage discovery and sharing through external services, and 3)formalizing the notion of collections and worksets in the context of the HathiTrust Research Center. Includes an open, competitive Request for Proposals in November 2013, with the intent to fund four prototyping projects that will build tools for enriching and augmenting metadata for the HathiTrust corpus.

Tweet Us: #HTRC #SESS037 #EDU13 HTRC Sloan Cloud for Secure Text- Mining at Scale Funded at $606,000 by The Alfred P. Sloan Foundation; Beth Plale, Indiana University, PI; Atul Prakash, University of Michigan, Co-PI; Fall Spring Goal: Prototype a system that enables secure text mining to be carried out at scale using public cloud resources, including: 1.a software cloud infrastructure based on OpenStack 2.mechanisms for managing a secure virtual machine We plan The Sloan Cloud will provide users with dedicated virtual machines that are pre-configured with appropriate tools and provide secure access to remote data that cannot be funneled through the VM to outside filesystems.

Tweet Us: #HTRC #SESS037 #EDU13 Thank You This presentation was made possible with content provided by many HTRC colleagues John Unsworth, J. Stephen Downie, Beth Plale, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. HTRC - IU D2I Center - UIUC GSLIS -

Tweet Us: #HTRC #SESS037 #EDU13 Contact Information Speakers : Robert H. McDonald, Indiana University Beth Sandore Namachchivaya, University of Illinois John Unsworth, Brandeis University Requests for assistance: Miao Chen, HTRC Education and Outreach

The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Tweet Us: #HTRC #SESS037 #EDU13