From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library.

Slides:

Advertisements

Similar presentations

Creative Create Lists Elizabeth B. Thomsen Member Services Manager North of Boston Library Exchange

Advertisements

Record Import Service Importing files of records into LA via the Internet.

BETH BRENNAN CHRISTINE MOULEN ELUNA 5/2/2014 Automating MARCit! for a single-record approach.

Start First step Create a new blank database Create a database using the option that will enable you to build your database using pre-set options. Save.

Create a new blank database First step SUBMITTry again.

Cataloging: Millennium Silver and Beyond Claudia Conrad Product Manager, Cataloging ALA Annual 2004.

Integrating Learning Resources in StudyNet Paul Hudson Learning Technology Development Unit Learning and Information Services University of Hertfordshire.

Cambridge University Library From Data To Discovery Pete Girling – Systems Librarian, Huw Jones - Systems Librarian,

Integrating Learning Resources in to a MLE Paul Hudson Learning Technology Development Unit Learning and Information Services University of Hertfordshire.

1 Maintaining the integrity of e-book titles in CityU library catalogue 7 th HKIUG, 12 Dec 2006, HKUST Joanna Pong, Philip Wong Run Run Shaw Library City.

Library integrated system -Aleph Fang Peng Stony Brook University.

Cambridge University Library Existing records Problems and Solutions.

Catalog: Batch delete old Patron Records How to conduct global/batch updates to records – patron Adding Faculty and Patron/Student Records Manually Standardizing.

Just for the record Bibliographic Data – where we were, where we are, where we’re going Huw Jones

SCARY QUERIES LAID TO REST Getting Started with Voyager Prepackaged Access Reports Presented by Jean Vik, Associate Library Director The University of.

M AKING E - RESOURCE ACCESSIBLE FROM ONLINE CATALOG *e-books *serials Yan Wang Senior Librarian Head of Cataloging & Database Maintenance Central Piedmont.

DB Audit Expert v1.1 for Oracle Copyright © SoftTree Technologies, Inc. This presentation is for DB Audit Expert for Oracle version 1.1 which.

At the North of England Institute of Mining and Mechanical Engineers Library, Newcastle upon Tyne.

Global Update with Confidence Mary M. Strouse Innovative Users Group May 19, 2009.

WILIUG 1. June 2, 2005 Using Review Files with Millennium Rapid & Global Update jenny schmidt SWITCH Library Consortium.

Forcing a change into the Global Change Queue A strategy for handling heading changes when there is no matching authority record.

Libraries Australia Cataloguing Parallel Session Bemal Rajapatirana / Rob Walls.

Vended Authority Control --Procedures and issues.

What’s New in VRS? GUGM May 15, 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University

Authority Control for Names in Tables of Contents: Special Processing of 970 Fields Carol Love.

Alma 1 year after STP: implementing batch services IGeLU Budapest Sep 2, 2015 Bart Peeters Head Operations LIBIS.

Subject To Change automatic catalog enrichment with subject headings and codes 10th IGeLU conference Budapest, Marcus Zerbst Zentralbibliothek.

Project Overview Bibliographic merging, Endeca, and Web application.

Cataloging 12.3 to 14.2 Seminar. Cataloging 2 -New check routines -Cataloging authorizations -Other innovations -Fix and expand routines -Floating keyboard.

The British Library and SUNCAT Brenda Young The British Library Bibliographic Development.

Relational Databases Melton, Beth “Databases: Access Terminology and Relational Database Concepts.” 09/LPMArticle.asp?ID=73http://pubs.logicalexpressions.com/Pub00.

Writing macros and programs for Voyager cataloging Kathryn Lybarger ELUNA 2013 May 3, #ELUNA2013.

OPAC Training aid (Library solutions & Library world)

The University of Cambridge Universal Catalogue: a work in progress Patricia Killiard Head of IT Services Cambridge University Library.

Copyright 2002, Jeremy Zawodny MySQL Backup & Recovery O’Reilly Open Source Convention Jeremy Zawodny Yahoo! Finance July 24th, 2002.

Get your hands dirty cleaning data European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford

Create Lists in Millennium Jenny Schmidt SWITCH Library Consortium.

Authority Control and Bib Enhancement with Marcive Mark Sandford William Paterson University

Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.

Understanding InfoHawk Indexes Technical Background for Libraries Staff Patricia Baird Sue Julich.

MARCIVE - An Overview Part one of an authority workshop presented September 2001 by: Jenifer Marquardt Assistant Authorities Librarian University of Georgia.

MARCIt records for e-journals project to implement MARCIt service McGill University Library Feb

Union Catalog Architecture Tsach Moshkovits, Development Team Leader Olybris, Ex Libris Seminar 2005 Kos, April 2005.

Cataloguing Session Libraries Australia Forum 2011.

The physical parts of a computer are called hardware.

The University of Edinburgh Where are our books : improving the recommendation to shelf process EVUGM September 5 th 2003 Jeremy Upton Bibliographic Services.

Loading Bibliographic Records Online and in Batch Pat Riva Romance Languages Cataloguer/ Bibliographic Database Specialist McGill University

What’s Hot in Finland Ere Maijala IT Research Specialist The National Library of Finland Hint: It’s not the weather.

Headings – Useful Concepts and Innovations in Prepared by Marina Spivakov, 2002; updated by Jerry Specht, June 2003.

Planning for RDA Authority Conversion Karen Anderson Authority Control Librarian Backstage Library Works Authority Control Interest Group ALA Annual Conference.

Cataloging v.16 eSeminar September 2003 Judith Fraenkel.

The ___ is a global network of computer networks Internet.

EDI from Blackwells Collection Manager to Voyager: Issues and Solutions Carol MacDonald University of Regina Library Issues/Solutions Objectives Contact.

IS OPEN THE LIBRARY Polaris ILS Cataloging 5.0 SP3 Training.

Lihong Zhu Interim Cataloging Manager/Monographic Cataloging Librarian Washington State University Libraries

BRIEF INTRODUCTION: AND EPID 745, April 3, 2012 Xiaoguang Ma.

INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.

SILO File Upload & Feedback System By Marie Harms State Library of Iowa August 18 & 19, 2010.

Accomplish more with macros! Presenter: Joyce Bell Princeton University

Introduction to Cataloguing on Aleph. Navigation Aleph is made up of panes When you first log in you will see this screen. In the navigation pane on the.

Creative Create Lists Elizabeth B. Thomsen Member Services Manager

What is it that cataloguers and librarians fear the most?

Cataloging introductory flow

Tools and Techniques to Clean Up your Database

Tools and Techniques to Clean Up your Database

Working the A to Z List enhance journal access in the OPAC

Optimize Your Java Code By Tools

Cataloging 14.2 Seminar.

Maintaining the integrity of e-book titles in CityU library catalogue

Everything Union Catalog

Presentation transcript:

From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library

Small city, big University = lots of libraries! Cambridge

Lots of libraries = lots of books

Bibliographic records University Library: 3.85 M Other libraries: 2.5 M 8 databases

Data problems Quality Duplication

Quality - fullness of 2.5 M records in our databases 1 M are short records

Quality – coding

Duplication

Effects Difficulty in resource discovery Patchy retrieval Lack of authority control Difficulty with standard deduplication Burden on staff time Ties us to multiple database model

Aims Better records Fewer records

Existing Solutions? Manual recataloguing Commercial solutions Universal catalogue Discovery layer Either don’t solve the core problem, or expensive and/or time consuming

Our solution Automated Cataloguing Tools! Short record enrichment Automated MARC correction Deduplication Order important – full, well coded records are easier to deduplicate

General principles Retrieve some records from a Voyager database Examine and/or manipulate them If necessary, make changes in the database N.B. Watch indexes and table space!

General tools Perl – holds everything together Perl DBI – connects to databases SQL – retrieves records from database MARC::Record modules (from CPAN) – to examine/manipulate records Pbulkimport/Batchcat – to make changes to the database

Batchcat vs Pbulkimport Batchcat – installed on PC with Voyager More versatile Can’t be used on server Pbulkimport – limited functionality Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN) Can be used on server

Books Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN:

Enriching short records How to get from this …

to this

Basic mechanism Take short record Find a matching full record Overlay short record with full record Need a source of full records In Cambridge - University Library - large database of full, authority controlled records

Connects to EXTERNAL source. Finds best FULL RECORD match and scores it Connects to LOCAL database and checks if a valid bib id Retrieves SHORT RECORD info from local database File of SHORT RECORD bib ids Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD In local database overlays SHORT RECORD with FULL RECORD

Output

Interface

Results Service has been running for 1 year (much of which was testing) 18 libraries subscribed to use service 90,000 short records upgraded

MARC checking and correction Bibliographic standard – agreed minimum standard for cataloguing Every week, libraries receive an automatically generated file of MARC coding errors for correction Based on MARC::Lint module with many alterations

Output

Mechanism Connects to database using Perl DBI Retrieves MARC record for records created/edited in last week Runs them through MARC check Prints errors to file s file to library Over 100,000 errors pointed out so far!

MARC Correction How to get from this … =LDR 00472nam\\ \a\4500 = = = s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a =100 1\$aBroecker, W.S.,$d1931- =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker. =260 \\$aNew York ;$bEldigio Press,$cc1985 =300 \\$a291p $bill $c23cm =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.

to this! =LDR 00453nam a 4500 = = = s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a =100 1\$aBroecker, W. S.,$d1931- =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker. =260 \\$aNew York :$bEldigio Press,$cc1985. =300 \\$a291 p. :$bill. ;$c23 cm. =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.

MARC Correction Version of module which, where there is no ambiguity, corrects errors Built into short record upgrade program Also offered as a retrospective service to clean up legacy records Possibility of building it into weekly check

Mechanism Connects to database using Perl DBI Retrieves full MARC record Runs against correction module Replaces corrected record in database

Output Bib id: How to build a habitable planet ; By Wallace S. Broecker. 100: UPDATE: Spaces inserted between initials in subfield _a 245: UPDATE: By uncapitalised at start of subfield c 245: UPDATE: Space forward slash inserted before subfield _c 260: UPDATE: Full stop inserted at end of field 260: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Full stop inserted after the p in pagination 300: UPDATE: Full stop inserted at end of field 300: UPDATE: Illustration abbreviation has been corrected 300: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Space inserted between digits and cm 300: UPDATE: Space inserted between digits and p in pagination 300: UPDATE: Space semi-colon inserted before subfield c

Results In testing 70,000 records processed Corrected over 200,000 MARC coding errors May run ALL our existing records through at some stage

Deduplication – in progress! Three stages: Identification of groups of duplicates Identification/construction of ‘best’ record Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’

Identification of duplicates Connect to a database with Perl DBI Use SQL to retrieve records For each record, retrieve all available data from tables Use matching algorithm to identify groups of duplicates

And you’ll end up with something like this:

Identification of best record For each of group of duplicates, MARC records retrieved Passed to scoring algorithm Record with highest score forms basis of ‘best’ record Retains set fields (i.e. subject headings) from ‘other’ records Corrects any MARC coding errors

But … No relinking functionality, even in BatchCat No viable workaround for libraries using Acquisitions/without losing circulation history

In conclusion … Tools for librarians, not replacements! Do the stuff programs do well, allowing humans to concentrate on what humans do well Won’t do all the work, just makes a solution to major data problems feasible

Questions?