From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library
Small city, big University = lots of libraries! Cambridge
Lots of libraries = lots of books
Bibliographic records University Library: 3.85 M Other libraries: 2.5 M 8 databases
Data problems Quality Duplication
Quality - fullness of 2.5 M records in our databases 1 M are short records
Quality – coding
Duplication
Effects Difficulty in resource discovery Patchy retrieval Lack of authority control Difficulty with standard deduplication Burden on staff time Ties us to multiple database model
Aims Better records Fewer records
Existing Solutions? Manual recataloguing Commercial solutions Universal catalogue Discovery layer Either don’t solve the core problem, or expensive and/or time consuming
Our solution Automated Cataloguing Tools! Short record enrichment Automated MARC correction Deduplication Order important – full, well coded records are easier to deduplicate
General principles Retrieve some records from a Voyager database Examine and/or manipulate them If necessary, make changes in the database N.B. Watch indexes and table space!
General tools Perl – holds everything together Perl DBI – connects to databases SQL – retrieves records from database MARC::Record modules (from CPAN) – to examine/manipulate records Pbulkimport/Batchcat – to make changes to the database
Batchcat vs Pbulkimport Batchcat – installed on PC with Voyager More versatile Can’t be used on server Pbulkimport – limited functionality Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN) Can be used on server
Books Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN:
Enriching short records How to get from this …
to this
Basic mechanism Take short record Find a matching full record Overlay short record with full record Need a source of full records In Cambridge - University Library - large database of full, authority controlled records
Connects to EXTERNAL source. Finds best FULL RECORD match and scores it Connects to LOCAL database and checks if a valid bib id Retrieves SHORT RECORD info from local database File of SHORT RECORD bib ids Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD In local database overlays SHORT RECORD with FULL RECORD
Output
Interface
Results Service has been running for 1 year (much of which was testing) 18 libraries subscribed to use service 90,000 short records upgraded
MARC checking and correction Bibliographic standard – agreed minimum standard for cataloguing Every week, libraries receive an automatically generated file of MARC coding errors for correction Based on MARC::Lint module with many alterations
Output
Mechanism Connects to database using Perl DBI Retrieves MARC record for records created/edited in last week Runs them through MARC check Prints errors to file s file to library Over 100,000 errors pointed out so far!
MARC Correction How to get from this … =LDR 00472nam\\ \a\4500 = = = s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a =100 1\$aBroecker, W.S.,$d1931- =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker. =260 \\$aNew York ;$bEldigio Press,$cc1985 =300 \\$a291p $bill $c23cm =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.
to this! =LDR 00453nam a 4500 = = = s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a =100 1\$aBroecker, W. S.,$d1931- =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker. =260 \\$aNew York :$bEldigio Press,$cc1985. =300 \\$a291 p. :$bill. ;$c23 cm. =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.
MARC Correction Version of module which, where there is no ambiguity, corrects errors Built into short record upgrade program Also offered as a retrospective service to clean up legacy records Possibility of building it into weekly check
Mechanism Connects to database using Perl DBI Retrieves full MARC record Runs against correction module Replaces corrected record in database
Output Bib id: How to build a habitable planet ; By Wallace S. Broecker. 100: UPDATE: Spaces inserted between initials in subfield _a 245: UPDATE: By uncapitalised at start of subfield c 245: UPDATE: Space forward slash inserted before subfield _c 260: UPDATE: Full stop inserted at end of field 260: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Full stop inserted after the p in pagination 300: UPDATE: Full stop inserted at end of field 300: UPDATE: Illustration abbreviation has been corrected 300: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Space inserted between digits and cm 300: UPDATE: Space inserted between digits and p in pagination 300: UPDATE: Space semi-colon inserted before subfield c
Results In testing 70,000 records processed Corrected over 200,000 MARC coding errors May run ALL our existing records through at some stage
Deduplication – in progress! Three stages: Identification of groups of duplicates Identification/construction of ‘best’ record Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’
Identification of duplicates Connect to a database with Perl DBI Use SQL to retrieve records For each record, retrieve all available data from tables Use matching algorithm to identify groups of duplicates
And you’ll end up with something like this:
Identification of best record For each of group of duplicates, MARC records retrieved Passed to scoring algorithm Record with highest score forms basis of ‘best’ record Retains set fields (i.e. subject headings) from ‘other’ records Corrects any MARC coding errors
But … No relinking functionality, even in BatchCat No viable workaround for libraries using Acquisitions/without losing circulation history
In conclusion … Tools for librarians, not replacements! Do the stuff programs do well, allowing humans to concentrate on what humans do well Won’t do all the work, just makes a solution to major data problems feasible
Questions?