HathiTrust And Its Research Center John Unsworth University of Virginia September 27, 2018
“Hathi” means elephant, in Swahili “Hathi” means elephant, in Swahili. The elephant is a symbol of both massiveness and memory. The motto of the HathiTrust is “There’s an elephant in the library.” This photo is of an actual elephant in an actual library (in mid-20th-century Edinburgh, Scotland, as part of a PR campaign to remind patrons to return their library books).
HathiTrust Mission The mission of HathiTrust is to contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge.
Big Data: The HathiTrust 16,744,704 total volumes 8,126,978 book titles 449,098 serial titles 5,860,646,400 pages 751 terabytes 198 miles (shelf-wise) 13,605 tons (book-wise) 6,277,976 volumes (~35% of total) in the public domain, so 65% in copyright. Stats updated daily at: https://www.hathitrust.org/about
Domains of Knowledge Call Numbers from A-Z 50% in English A long tail of languages: over 450 different languages, ancient and modern, from Aleut to Zulu Publication dates from the 16th century to the present
HathiTrust Manages Complex Copyright Conditions For Access Type of work Searchable (bibliographic and full-text) Viewable* Data API Print on Demand Print disabilities* Preservation uses (Section 108)* Public domain worldwide Worldwide Partners only if scanned by Google, if not, worldwide. Partners worldwide N/A Public domain (US) – Non-US works published between 1872 and 1923. When accessed from with the United States Partners in the US if scanned by Google, if not, anyone US Available within the United States Partners in the US; partners worldwide where similar laws in effect Works that rights holders have opened access to in HathiTrust Worldwide (if digitized by Google, full-PDF only available if opened with CC license) Worldwide with permission Works that are in-copyright or of undetermined status Not available Partners in the US; partner worldwide where similar laws in effect HathiTrust recognizes and implements a complex set of copyright conditions. For the most part, HathiTrust texts are search-and-snippet only. The exceptions: out of copyright works, works provided for students with documented print disabilities, and texts used for non-consumptive computational analysis.
HathiTrust Research Center The research arm of the HathiTrust, meant to provide computational access to the entire corpus, including copyrighted materials. Collaborative effort based at the University of Illinois and Indiana University and supported by co-investments from these universities and the University of Michigan, through the HathiTrust. Initiated in 2008, formally established in 2011 HTRC operates under a grant from the University of Michigan to Indiana University and the University of Illinois Urbana Champaign, with significant financial contributions from IU & UIUC.
HathiTrust Research Center Mission The mission of the Hathi Trust Research Center is to provide infrastructure and tools enabling and supporting computational research on the more than 16 million volumes of the HathiTrust collection. HTRC enable scholars to fully utilize content of HathiTrust under Fair Use, while preventing violations of U.S. copyright law.
Non-Consumptive Research Paradigm Not research in which a researcher reads or displays substantial portions of an in-copyright or rights-restricted volume to understand the expressive content presented within that volume. Non-consumptive analytics includes such computational tasks as text extraction, textual analysis and information extraction, linguistic analysis, automated translation, image analysis, file manipulation, OCR correction, and indexing and search. More here: https://www.hathitrust.org/htrc_ncup
Non-Consumptive Research Paradigm Bring the COMPUTATION to the DATA!
HTRC Analytics Since 2011, the HathiTrust Research Center has been developing services and tools that allow researchers to employ text and data mining methodologies using the HathiTrust collection. To date, this service has been available only on the portion of the collection that is out of copyright.
HTRC Analytics With the development of a landmark HathiTrust policy and an updated release of HTRC Analytics, HTRC now (9/24/2018) provides access to the text of the complete 16.7-million-item HathiTrust corpus for non-consumptive research, such as data mining and computational analysis, including items protected by copyright.
Three Approaches HTRC Analytics (for pre-determined web-based analyses, including Bookworm) Feature Extraction Services (including downloadable data sets) Secure Data Capsule access “Features” here are page-level derived statistical data, including unique words per page, number of occurrences of word per page, part of speech information, etc.
HTRC Analytics for All HTRC Algorithms: web-based, click-and-run tools to perform computational text analysis on shared public worksets or those you have created, including copyrighted items for ALL USERS. Extracted Features Dataset: Allows non-consumptive analysis on specific features extracted from the full text of the HathiTrust corpus, including copyrighted items for ALL USERS. HathiTrust+Bookworm: a tool for visualizing and analyzing word usage trends in the HathiTrust corpus. Including copyrighted items for ALL USERS.
HTRC Analytics for Members HTRC Data Capsule: a secure computing environment for text analysis on the HathiTrust corpus, using the researcher’s tools of choice. Access to copyrighted items using an HTRC Data Capsule is available ONLY to HathiTrust member-affiliated researchers, because we anticipate significant demand for this service and HTRC has finite resources to support it.
HathiTrust Member Institutions
How Is This Possible? HathiTrust exists to enable lawful research and educational uses of its collection. In recent years, US courts have recognized that there is a legal basis for non-consumptive research on copyrighted materials. In 2016, HathiTrust established a Non-Consumptive Use Research Policy to ensure the responsible research use of copyrighted items. That policy is now embodied in the HTRC Analytics services, which allow researchers to conduct computational text analysis on copyrighted items, under the fair use provisions of US copyright law.
What’s Next? In collaboration with HTRC, JSTOR and Portico staff, I and some of my staff at UVA are exploring distributed text-mining as a way to enable TDM across both HTRC’s book materials and JSTOR and Portico’s journal materials. We’re starting with sample data sets in biology from HTRC and from 10 Portico publishers who agreed to be part of this pilot. We have developed interoperable Extracted Features Datasets as our first proof-of-concept, demonstrating that we have harmonized our metadata (no mean feat).
Distributed Text-Mining If text-mining services are only available on a per-publisher basis, there will be no real competition on the merits of the service, and no practical way for researchers to work across publisher collections. Distributed text-mining across HTRC-Portico-JSTOR materials could help to establish metadata interchange guidelines, APIs, and text-mining techniques that will make it possible for researchers to work with even broader collections of copyrighted content that can’t be aggregated and indexed in one place.
Acknowledgements At Indiana University, HTRC is affiliated with and supported by the IU Pervasive Technology Institute, the School of Informatics, Computing, and Engineering, and the IU Bloomington Libraries. Additional financial support comes from the Office of the Vice Provost for Research. Computational resources are provided by the Pervasive Technology Institute. At the University of Illinois Urbana-Champaign HTRC is hosted and supported by the School of Information Sciences in collaboration with the University of Illinois Library. Financial support is provided by the Office of the Provost and the Office of the Vice-Chancellor for Research. Additional resources to advance the mission of HTRC are supplied by the National Center for Supercomputing Applications.