Creating a Registry of U.S. Federal Documents

Slides:



Advertisements
Similar presentations
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
Advertisements

HathiTrust Sharing a Federal Print Repository: Issues and Opportunities May 25, 2011 Heather Christenson.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
Dr. Clem Guthro, Director of the Colby College Libraries MSCS Project Co-PI Maine Shared Collections Strategy: Print Archive Network Update
The White Rose Collaborative Collection Partnership Brian Clifford University of Leeds.
Authentication of the Federal Register Charley Barth Director, Office of the Federal Register United States Government.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
Moving Shared Print to the Network Level Emily Stambaugh ALA Annual Conference Las Vegas, NV June 27, 2014 “Looking to the Future of Shared Print” Shared.
The FDLP Web Archive Dory Bower Archive-It Partner Meeting November 18, 2014.
Alternate Views: Outside Comments and Reviews of the FDLP ( ) Kate Tallman: University of Colorado-Boulder Janet Fisher: State Library of Arizona.
CERES AND COLORADO STATE UNIVERSITY LIBRARIES. PROJECT CERES Begun in 2013, Project CERES is a Center for Research Libraries Global Resources Agriculture.
OCLC Online Computer Library Center A Global OpenURL Resolver Registry Phil Norman OCLC Dlsr4lib Workshop March 23 rd, 2006 Arlington VA.
Session 7 Selection of Online Resources and Options for Providing Access.
VALIDATION AND RISK MANAGEMENT FOR SHARED PRINT JOURNAL ARCHIVES: UC JSTOR AND WEST CASES CRL Print Archives Network Meeting ALA Midwinter, Chicago January.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Technical Services & Cataloging and Classification Jennifer Anielski and Christina Tracy IS 554 Public Library Management.
Metadata Guidelines for Disclosing Shared Print Commitments Lizanne Payne Shared Print Consultant ALA Midwinter 2013.
HathiTrust – How To By Dr. Rob McGeachin 20 th Annual AgNIC Meeting May 7, 2015.
Digitization of the Federal Depository Library Program Judith C. Russell Superintendent of Documents & Managing Director, Information Dissemination “Electronic.
Digitization Panel August 12, 2010 Christopher C. Brown, coordinator Mike Culbertson, Colorado State U. James Mauldin, GPO.
1 On the Record Report of the Library of Congress Working Group on the Future of Bibliographic Control Diane Boehr Head of Cataloging, NLM
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
PROMOTING TRANSPARENCY THROUGH LIBRARY – GOVERNMENT COLLABORATION Presentation to BCLA Government and Legal Information Gathering May 13, 2011.
The Value of Geospatial Metadata Metadata has tremendous value to Individuals within your organization, as well as to individuals outside of your organization.
Challenges and Opportunities for Academic Libraries Collaborative Imperatives to Support Collections, Digital Initiatives, and New Services for a Changing.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
HATHITRUST A Shared Digital Repository The HathiTrust Print Monograph Archive Planning Task Force Print Archive Network Forum ALA 2015 Annual Meeting June.
Digital Accountability: The Line Between Producing and Preserving Digital Government Information Mary Alice Baish Superintendent of Documents Indiana State.
GPO POLICIES AND PLANS FOR SPATIAL INFORMATION DISTRIBUTION GPO POLICIES AND PLANS FOR SPATIAL INFORMATION DISTRIBUTION Judy Russell Superintendent of.
Susan Lyons, Rutgers University Law School Library, Moderator George Barnum, United States Government Printing Office Cathy Nelson Hartman, University.
Warwick Cathro Assistant Director-General Resource Sharing and Innovation National Library of Australia Trove – a service built on collaboration OCLC Asia.
Shared Print Gold Rush Library Content Comparison George Machovec, Executive Director Colorado Alliance of Research Libraries January 8, 2016 Print Archive.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
NOS PUBLICATIONS PROJECT Background NOAA Administrative Order “NOAA Information Access and Dissemination,” issued September 994 Section 6: Responsibilities,
Housing with Care and Support. Workforce challenges and solutions.
The Linking ISSN (ISSN-L): Crossing the Bridge to the Future Crossing the Bridge to the Future The Linking ISSN (ISSN-L): Crossing the Bridge to the Future.
1 United States Agricultural Information Network (USAIN) Judith C. Russell Dean of University Libraries Gainesville, Florida April 25, 2016.
AN ARCHETYPE FOR INFORMATION ORGANIZATION AND CLASSIFICATION OCLC WorldCat.
Zero-Textbook-Cost Degree Grant Program (Z Degrees)
HathiTrust Digital Library Interface and Services
Creating Workflow Efficiencies in the E-book Ecosystem
Headline.
Maine Shared Collections Strategy: Print Archive Network Update
Metadata Editor Introduction
Cleaning up the catalog: getting your data in order
Standing Orders in Alma
Education of a scientist video
Marie Waltz Center for Research Libraries
Headline.
Creating a Registry of U.S. Federal Documents
Old Serials, Dirty Data: Print Retention as an Opportunity
PAN Forum June 27, 2014 ALA Annual Conference Las Vegas, NV
Metadata Guidelines for Disclosing Shared Print Commitments
Metadata to fit your needs... How much is too much?
When to Hold On and When to Let Go: A Distributed Retrospective Library Assessment Conference, December 6, 2018 Jean Blackburn, Collections Librarian,
Overview & Update on Recent Canadiana Activities
Digitization Standards: Issues & Updates
CRKN and Canadiana Update
From The Outside Looking In To The Inside Looking Out
ORCID: ADDING VALUE TO THE GLOBAL RESEARCH COMMUNITY
Sound Preservation: First Steps
The Digital Library for Earth System Education (DLESE):
WEST Program Assessments Past and Present
Mary Miller Director of Collection Management & Preservation
Preserving Access for the Future
Presentation transcript:

Creating a Registry of U.S. Federal Documents Valerie Glenn June 24, 2016 Print Archive Network Forum, ALA Annual Meeting Background Federal documents are challenging for general users to access due to complexities of publication history, cataloging, and format. Federal documents contain an enormous trove of information about US and international history, policy, economics, science, and law, but print documents are not digitally searchable or readable. HathiTrust members’ collections occupy hundreds of miles of shelf space in their stacks at a time when budgets are shrinking

HathiTrust’s Goal Expand and enhance access to U.S. federal publications through the creation of a comprehensive digital corpus of U.S. federal publications including those issued by GPO and other federal agencies.   In 2011, the membership of HathiTrust held a meeting organized to determine the future of the organization and there identified US Federal Documents as an area for strategic investment and cooperative activity.  Members endorsed a resolution to Expand and enhance access to U.S. federal publications through the creation of a comprehensive digital corpus of U.S. federal publications including those issued by GPO and other federal agencies. providing widespread comprehensive access to these materials fits squarely within the goals of the depository library program as well as HathiTrust’s stated goal to create and “sustain a public good while at the same time defining a set of services that benefits member institutions.” 24 June 2016

Opportunities for HathiTrust Reducing costs of collection management and access in libraries Safeguarding digital surrogates within a non-profit context to preserve integrity of these valued resources Realizing the potential for new forms of discovery, access and research Capitalizing on digitization at scale. Carpe diem! HathiTrust has 143 partners, primarily research libraries Including 16 regional depository libraries and 90 selective depository libs We are well positioned to aim for the goal of a comprehensive collection Opportunities for HathiTrust, at scale, stemming from this kind of collection include Reducing costs of collection management and access in libraries Libraries may make collection management decisions based on digital copies in HathiTrust. (And the HT Shared Print program is getting into action) Safeguarding digital surrogates within a non-profit context to preserve integrity of these valued resources Digital preservation in a trusted location. HathiTrust has been certified through CRL’s Trustworthy Repositories Audit & Certification process Realizing the potential for new forms of discovery, access and research Conversion to digital form enables new forms of discovery, access and research, for example full text search, services for print disabled users, and computational research Capitalizing on digitization at scale. Carpe diem! Continuing to leverage HathiTrust partner’s mass digitization projects with Google and the Internet Archive, as well as serving as a shared location for content digitized through other coordinated efforts and locally digitized content. “With programs to convert legacy collections to digital form comes the opportunity to develop a comprehensive, network-accessible digital library of United States Federal publications” 24 June 2016

Building a comprehensive collection: challenges Diversity of federal depository libraries’ cataloging policies and practices over time make it difficult to definitively identify and match specific documents Total number of documents in existence that were produced at federal expense is unknown.  Estimates range from as low as 1.5 million to as many as 3 million. Absence of an inventory also makes it difficult to coordinate a collective effort across a group of institutions In June 2016 HathiTrust held over 700,000 items identified as federal documents, but we know this to be only a fraction of what exists.   Federal depository libraries’ practices over time have been diverse Many federal depository libraries have not cataloged their collections comprehensively records exist at the bibliographic level rather than the volume level many US federal government publications are cataloged as serials but are regarded by users and librarians as monographs It is difficult to definitively identify and match specific documents it has been impossible to estimate the total number of documents in existence that were produced at federal expense.  Estimates range from as low as 1.5 million to as many as 3 million. This poses a major  challenge to creating a “comprehensive digital corpus of U.S. federal publications.” In addition, the absence of an inventory makes impossible tasks like correlating the documents currently in HathiTrust with the total corpus, and coordinating collective effort across a group of institutions. 24 June 2016

HathiTrust US Federal Documents Registry Database of 5.5 million records derived from 25 million source records Contains a “Registry record” for each unique document, with a unique identifier Regularly updated with records from HathiTrust repository and GPO To meet these challenges, A major component of HathiTrust’s program has been the development of the US Federal Documents Registry, envisioned as a reliable inventory of items published at the expense of the US government.  Valerie Glenn, the HathiTrust Government Documents Registry Analyst, has been leading the project since 2013. I wish to acknowledge the great work of Valerie and her team to bring the Registry into existence. This all happened before I arrived at HathiTrust and I am the lucky one who gets to report on it. But it’s Valerie’s accomplishment. She has been the leader of the Registry planning, development, and launch. The Registry is a database of 5.5 million records derived from 25 million source records The Registry is intended to include metadata for the comprehensive corpus of U.S. federal documents. This includes materials produced at U.S. government expense, in all formats, at the item level, from 1789 to the present. It may also include works such as grant-funded or contract work, declassified materials, individual pieces of legislation (bills), administrative publications, and/or numerical data sets. The majority of source records were received in response to a call for records in 2013. In the fall of 2013 HathiTrust held an open call for records for federal documents. The call was open to all - not just partners. more than 25 million records contributed by more than fifty institutions) forms the basis of the Registry, along with records from the HathiTrust The Registry contains a “Registry record” for each unique document, with a unique identifier The initial source records were clustered as duplicate, related, or solo, based on identifiers - OCLC number, LCCN, ISSN, ISBN, and SuDoc (Superintendent of Documents) number. Records in duplicate clusters were assigned confidence scores ranging from 0 to 1, based on the number of common elements (ie, identifiers, title). Records in clusters with confidence scores of .9 or greater were compiled into Registry records. The remaining source records were added as separate Registry records. (If needed: Records are ingested as MARC, typically in XML or JSON format.) New and updated HathiTrust repository records are now added on a daily basis, and the Government Publishing Office’s Catalog of Government Publications records are acquired weekly 24 June 2016

Screenshot of Registry interface Public interface Focus groups to explore use cases We welcome your input https://www.hathitrust.org/usdocs_registry 24 June 2016

Addressing metadata challenges Records lacking enough information to identify De-duplication and relationship-detection amongst records Clustering by identifiers Improving item description “enumeration and chronology” & developing specifications for parsing Experiments with string matching Earlier I mentioned that the biggest challenge to defining the universe of federal documents is that the diversity of Federal depository libraries’ cataloging policies and practices over time make it difficult to definitively identify and match specific documents In addition, we have the complexities of publication history, and format to contend with as well. 5 million of the source records originally contributed have been excluded from the Registry due to lack of enough information to identify them The Registry team has been digging into the rest of the records, “Crunching” the data Repeated passes to improve de-duplication and relationship detection are what has enabled us to condense 20 million source records into 5.5 million registry records As mentioned before, Source records are clustered based on key identifiers. Staff are also developing specifications for parsing item description, enumeration and chronology, for titles such as the Federal Register in an attempt to decrease the amount of duplication in the Registry. In some cases this has meant eliminating information in the enum/chron field that doesn’t describe the item, repeats data from another field, and is preventing the record from being clustered with other source records. An example is the use of the term “ONLINE” in the enum/chron field. Another way in which staff have reduced the amount of duplication is by clustering together records for the same item in different formats, based on information in the MARC 776 field. The Registry record includes all relevant information for the item (ie, title, publisher, OCLC number) and also lists each format (ie Print, Microform, Online). We have also experimented with string matching (Valerie do you have a good example? Is this worth mentioning?)   24 June 2016

Current focus Support HathiTrust US Federal Documents Program near term priorities: Support digitization to build a comprehensive collection* Characterize the HathiTrust digital collection Gap detection Matching Registry to HathiTrust digital collection Matching Registry to HathiTrust member holdings So we have come a long way towards indentifying the universe of documents, and while we continue that work, we have also turned our attention to some actionable pieces that will support the emerging HathiTrust program Support digitization to build a comprehensive collection *HathiTrust collection building is currently focused on material distributed in a print format by the GPO as part of the FDLP. Our current work does not yet address “born-digital” documents. And Characterize the HathiTrust digital collection –what is the current breadth, depth, distribution by partner contribution? And numerous other questions that will enable us to have a better picture to develop the collection We are also working on gap detection –identifying gaps between the registry and the Hathitrust digital collection The HathiTrust member libraries’ holdings For example, several methods have been tested in an attempt to identify items in the HathiTrust repository that are federal documents but not coded as such in the records. -checking records for the presence of “f” and “u” in the marc 008 field -checking for the presence of a SuDoc call number We are also looking at criteria within registry records such as title, agency or source library to identify records that are not in the HathiTrust repository This gives you some flavor of the kind of work that we have going on. 24 June 2016

Thank you! Through our Federal Documents program we’ll aim to keep going towards that comprehensive federal documents collection. The HathiTrust federal documents planning and advisory group has done great work to stake out what this work will entail, and we plan to reactivate this amazing collective expertise to advise as we forge ahead. I’m excited to see this work continue, and for HathiTrust and the library community to realize the value of the Registry in active collection development and management. Thank you!

Your Questions For more information: HathiTrust web site https://www.hathitrust.org/usgovdocs_registry Valerie Glenn: valglenn@umich.edu Government Documents Registry Analyst Heather Christenson: christeh@hathitrust.org Program Officer for Federal Documents & Collections 24 June 2016