Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering the HathiTrust Digital Library Collection

Similar presentations


Presentation on theme: "Discovering the HathiTrust Digital Library Collection"— Presentation transcript:

1 Discovering the HathiTrust Digital Library Collection
Welcome! Angelina Zaytsev Collection Services Librarian

2 Interacting during the webinar
No attendee video today Attendee microphones will be muted Use chat window to ask questions Slides and a recording will be shared afterwards As we get started today, we want to give you the protocol for how we’ll be running this session. We’ll be controlling audio and video from our side, and we’ve decided to mute your microphones and cameras today. Over time we’ve learned that is the best way to ensure that we aren’t distracted by stray noise, echo, or delays. We will, however, take your questions. There is a chat icon that opens up a window for you type questions and we’ll periodically pause to read and respond. Also we plan to share these slides and the recording after the fact.

3 Agenda What’s in HathiTrust? Who can access content in HathiTrust?
How do I use the HTDL? + more info for members

4 Everyone watching is a member and has member access privileges.
My assumptions today: Everyone watching is a member and has member access privileges. Everyone watching works in a library and is familiar with concepts common to the profession. I have two assumptions today If these don’t apply to you, you are welcome to continue watching. I wanted to make this clear at the beginning, though, as there are a few slides or moments where I reference things that are specific to members or librarians

5 What’s in HathiTrust?

6 HathiTrust Collections: March 2017
15 million total items 7.5 million book titles 418,000 serial titles 799,000 US federal government documents 5.8 million items open (public domain & CC- licenses) Here are the current numbers for the repository. We currently have 15 million items. This number represents each unique digital object in the repository - if we’re talking about how many titles or bibliographic records we have, that breaks down into 7.5 million monographs and 418 thousand serials. 6 April 2016

7 The distribution of languages in HT has stayed consistent over the years. English unsurprisingly makes up 50% of the collection, and then the other languages in the top 10 include European languages such as German, French, Spanish, and then we start to stray into the Asian continent with Russian, Chinese, Japanese languages as well. The top 10 languages make up about 87% of the collection  The remaining 13% of the collection is made up of over 451 languages. 

8 In terms of when materials were published, the largest portion of the HathiTrust collection was published in the 1900s and later. The bump we see from is actually an artificial bump and is the result of missing or incomplete dates in the bibliographic records

9 Subjects of HathiTrust content, as derived from Library of Congress Classification classes
One way of looking at the subject matter of HathiTrust materials is to use the Library of Congress Classification classes. This may help you in determining which of your users may find HathiTrust more useful. Perhaps unsurprisingly, Language and Literature is our most heavily represented area. There is also a lot of content related to Social Sciences and World History. Keep in mind, that this data represented is at the title or record level, so there are far more volumes than there are records. Also, we have only received LCC classes in about 60% of the records we received from contributors. Note: LCC classes are only present in 60% of HathiTrust records.

10 Top 10 Library of Congress Classification Subclasses
Here’s a look at the top 10 LCC subclasses, further highlighting the points I made earlier about literature, language, social sciences and history being heavily represented in the collection. You can dive into the languages, dates and LCC classes further at the statistics page online.

11 Where does content come from?
When thinking about the HT collection, it’s important to think about the origins and provenances of the collection, in order to understand its strengths and limitations.

12 Top 10 Contributors to HathiTrust, March 2017
HathiTrust does not scan any content itself. Instead, our member libraries scan materials and deposit them into our collection. The HTDL mirrors the collections of our libraries. The University of Michigan and the University of California campuses are our most prolific contributors, followed by other institutions that are part of the Big Ten Academic Alliances.

13 Special Collections Universidad Complutense de Madrid: Latin, Spanish and French documents from the s Keio University: 92,000+ Japanese and some Chinese language materials Islamic Manuscripts from University of Michigan: 8th-20th century CE mss., 1,795 titles in Arabic, Persian, and Ottoman Turkish languages, collaborative cataloging project Benson Latin American Collection, University of Texas at Austin: 460,000 vols related to Latin American culture and history US Fed Gov Docs: 799,000+ documents and growing! I wanted to highlight some of the unique collections that our partners have contributed UCM is one of the oldest universities in the world - around since 1293 Keio is the oldest university of Japan

14 US Fed Gov Docs Numbers in the HTDL collection:
415,076 bibliographic records 981,335 separate digital objects 389,864 monographs 24,870 serial titles Fed Docs Registry: attempting to id all US fed docs ever created Federal Documents Collection Profile

15 What kind of content formats?
Scanned from “book-like” materials Image formats: TIFFs and JPEG2000s Plain-text OCR generated automatically PDFs are generated on demand (NOT stored in the repository) It’s important to keep this in mind when using HT content. We are continuing to explore other formats as well

16 Who digitized it? Type Characteristics Google Internet Archive
94.8% of the collection Download restrictions Primarily scanned in black and white with some color pages Large-scale mass digitization = quality can vary Internet Archive 3.7% of the collection No download restrictions Scanned in full color (as a result, file sizes are 2.5 times larger than Google content) Locally digitized & vendor services 1.48% of the collection Various restrictions may apply Typically small scale, “boutique” digitization = high quality (with some exceptions)

17 Quality Management Validation upon ingest Updated versions from Google
Proactive work by partners Report errors to us! Blog post: “Quality in HathiTrust” by Jeremy York and Kat Hagedorn I want to make a quick tangent to discuss quality of volumes and the work we do to manage quality. Before content is ingested, it must pass our rigorous validation specifications. These specs address formats, compression, image resolution, metadata, and other technica things can can be captured through automated means. However, these measures are not able to, for example, detect missing or damaged pages (unless that’s provided in the metadata provided by contributors). Specifically related to Google Books: From the very beginning of the Google Books Project, quality has been an important issue. When you digitize millions of books, errors are inevitable. OCR algorithms are not fool-proof, books that have been sitting on shelves for decades are not in the best condition, and people make mistakes. However, considering the vast number of materials that have been digitized through mass digitization projects, it is remarkable how infrequently we find errors. When we do find errors, we have a number of ways to address the quality. Google is continuously reviewing and improving its image processing and analysis tools. When an error is reported, we are often able to retrieve a better version of a volume from Google, and we also manually work with them to address issues. We also recently launched a process where libraries who had scanned their content through Google are able to scan individual pages and upload them to supplement problem pages in Google volumes. Libraries that have contributed scans that they made locally or through other vendors are also inspecting the quality of their volumes, and we have worked with some contributors to re-scan items for the collection. And finally we have a very active community of users who are keenly interested in improving the quality of the collection. Since April 2011, the User Support Working Group has received and successfully resolved over 2600 quality issues. So please do report errors to us when you find them, and we may be able to fix them. I'd like to recommend this blog post by Jeremy York and Kat Hagedorn which provides more details about the reasons behind common quality errors and the fixes that we can undertake.

18 Questions?

19 Who can access content in HathiTrust?

20 As you may have noticed, you are not able to access all of the content in HathiTrust. Although HathiTrust has nearly 15 million volumes, only 5.9 million of those volumes are available in full view. Limited view = the user can’t view the pages of the book but they can search within it Full view = the user can search within the book and also view the pages. Some are open because they are in the public domain, either in the US alone or to users located worldwide. US federal documents are not protected by copyright and are available in full view. That sliver of .27% items opened with permissions may seem pretty small, but .27% of 15 million items comes out to 40,000 items.

21 Access is determined by several factors
View Copyright status of the item Derived from: Bibliographic metadata (inc for US fed gov docs) Manual copyright review (CRMS) Permissions agreements Geographic location of the user In the United States vs. Outside the United States Download Member affiliation Yes/No Digitization source and/or contributing institution Any restrictions imposed by these entities? By far, the most frequently asked question that we receive in user support is “why can’t I access this thing?” When a user says “access,”however, they are typically talking about one of two things: being able to view an item or being able to download an item. There are a number of factors: Viewability is driven first by copyright status. For most of the items in the collection, we rely on the dates and locations of publication that are listed in the bibliographic record. For a very small subset of volumes, about 600,000, we were able to manually review the copyright status of books through the Copyright Review Management System. These are typically materials that have been published in the United States, Canada, Australia and the United Kingdom within specific time periods. We have also received permission from authors and publishers to make their works available in our collection. Viewability of content: can vary depending on whether a user is located within or outside the United States, as detected by a user’s IP address. We do this because for some content, we are only sure of its copyright status according to US copyright law. The ability to download material is driven primarily by whether a user is affiliated with a member institutions and who digitized or provided the content. We only impose download restrictions where it has been required by the contributor or vendor.

22 Anybody anywhere can... Read public domain and open access works (via web) Search across the full text of the entire collection (via web) Receive a dataset of public domain content (under certain conditions) Mine text and data (via HTRC) Mine text of copyrighted content (in pilot mode via HTRC) To summarize:

23 Members can... Download public domain and open access works
Provide access to their users who are blind or with print disabilities Get access to replacement copies for lost and damaged print copies (under certain conditions)

24 To get access to member benefits, always login.

25 Access for users with print disabilities
Access to ALL content A mediated access service for members Staff member at university has to pre-register then downloads and distributes content to users Connect with your local office of disability services! One other benefit that our members are able to take advantage of is the print disability service. This is a mediated access service at this time. A staff member has to register with us, and then they access content and distribute it to eligible users on demand. This is a great opportunity, but we haven’t seen the uptake on this service as much as we would like. Services for blind and print disabled students are typically located at offices separate from the library, and our main connections are with the libraries at our partner campuses. We would encourage our member libraries to connect with these offices to ensure that they are taking full advantage of this service. When implemented, this service can minimize the amount of time that staff spend scanning full books and converting them to OCR.

26 In the public domain (anyone anywhere can view) Digitized by Google
So how do you know what you’re able to do? We’ve left some clues in the interface. In the public domain (anyone anywhere can view) Digitized by Google User needs to login as a member to download book Opened with a Creative Commons license (overrides any vendor restrictions) Anyone anywhere can download book Public domain for users in the USA No vendor restrictions Any user in the US can download

27 How do I use the HTDL?

28 Interface overview Searching the catalog Searching the full-text
Interacting with a volume Creating personal collections

29

30 Other ways to retrieve data
Get bib records in bulk: Get datasets: Get content programmatically: Get bib records for known identifiers: Get some data for all HT content:

31 More ways to work with text
HathiTrust Research Center provides a secure environment and services to support data mining and text analysis Portal: HathiTrust+Bookworm: visualize word trends Extracted Features dataset: bits of data about the content Advanced Collaborative Support (ACS): mini-grants where awardees get staff time, not $ IN DEVELOPMENT: HTRC workshops to “train the trainers” how to mine HT data

32 More info for members

33 For best ROI, implement widely
Variety of ways for libraries to market and incorporate HathiTrust in their arsenal of services List and describe in LibGuides Add links to HathiTrust full-view content to your existing content using hathifiles Add records for HathiTrust full-view content to your catalog: via OAI, retrieval from OCLC, discovery services (Summon, Primo, WMS) Add records for all HathiTrust content to your catalog and carefully manage user expectations

34 HathiTrust Shared Print Program
Goal: Link the preservation of HathiTrust digital and print holdings to ensure that corresponding print copies continue to be available Retain volumes that mirror book titles in the HathiTrust Digital Library (currently 7.4 million) Maintain a lendable shared print collection distributed among HathiTrust member collections Reflect support by and provide benefits to all HathiTrust members (not an opt-in subset)

35 HTSP Program Phases Finalize policies & MOU
Identify initial retentions Phase 1: Quick Launch Adopt tools for collection analysis, collection management, resource-sharing Plan next priorities and services (e.g. digitization) Phase 2: Infra- structure Build momentum Build infrastructure 50+ libraries in Phase 1! Program Officer Lizanne Payne will be reporting out to the membership towards the end of Phase 1, so keep an eye out for more info soon. 2018--

36 More information FAQ @ https://www.hathitrust.org/help
Contact Problems with content or records? Click “feedback” Subscribe to newsletters

37 Questions?


Download ppt "Discovering the HathiTrust Digital Library Collection"

Similar presentations


Ads by Google