Download presentation
Presentation is loading. Please wait.
Published byMilo McKenzie Modified over 9 years ago
1
From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library
3
Small city, big University = lots of libraries! Cambridge
8
Lots of libraries = lots of books
9
Bibliographic records University Library: 3.85 M Other libraries: 2.5 M 8 databases
10
Data problems Quality Duplication
11
Quality - fullness of 2.5 M records in our databases 1 M are short records
12
Quality – coding
13
Duplication
14
Effects Difficulty in resource discovery Patchy retrieval Lack of authority control Difficulty with standard deduplication Burden on staff time Ties us to multiple database model
15
Aims Better records Fewer records
16
Existing Solutions? Manual recataloguing Commercial solutions Universal catalogue Discovery layer Either don’t solve the core problem, or expensive and/or time consuming
17
Our solution Automated Cataloguing Tools! Short record enrichment Automated MARC correction Deduplication Order important – full, well coded records are easier to deduplicate
18
General principles Retrieve some records from a Voyager database Examine and/or manipulate them If necessary, make changes in the database N.B. Watch indexes and table space!
19
General tools Perl – holds everything together Perl DBI – connects to databases SQL – retrieves records from database MARC::Record modules (from CPAN) – to examine/manipulate records Pbulkimport/Batchcat – to make changes to the database
20
Batchcat vs Pbulkimport Batchcat – installed on PC with Voyager More versatile Can’t be used on server Pbulkimport – limited functionality Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN) Can be used on server
21
Books Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320 Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994
22
Enriching short records How to get from this …
23
to this
24
Basic mechanism Take short record Find a matching full record Overlay short record with full record Need a source of full records In Cambridge - University Library - large database of full, authority controlled records
25
Connects to EXTERNAL source. Finds best FULL RECORD match and scores it Connects to LOCAL database and checks if a valid bib id Retrieves SHORT RECORD info from local database File of SHORT RECORD bib ids Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD In local database overlays SHORT RECORD with FULL RECORD
26
Output
27
Interface
28
Results Service has been running for 1 year (much of which was testing) 18 libraries subscribed to use service 90,000 short records upgraded
29
MARC checking and correction Bibliographic standard – agreed minimum standard for cataloguing Every week, libraries receive an automatically generated file of MARC coding errors for correction Based on MARC::Lint module with many alterations
30
Output
31
Mechanism Connects to database using Perl DBI Retrieves MARC record for records created/edited in last week Runs them through MARC check Prints errors to file Emails file to library Over 100,000 errors pointed out so far!
32
MARC Correction How to get from this … =LDR 00472nam\\2200157\a\4500 =001 662002 =005 20071205064734.0 =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a9780961751111 =100 1\$aBroecker, W.S.,$d1931- =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker. =260 \\$aNew York ;$bEldigio Press,$cc1985 =300 \\$a291p $bill $c23cm =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.
33
to this! =LDR 00453nam 2200157 a 4500 =001 662002 =005 20071205064734.0 =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a9780961751111 =100 1\$aBroecker, W. S.,$d1931- =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker. =260 \\$aNew York :$bEldigio Press,$cc1985. =300 \\$a291 p. :$bill. ;$c23 cm. =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.
34
MARC Correction Version of module which, where there is no ambiguity, corrects errors Built into short record upgrade program Also offered as a retrospective service to clean up legacy records Possibility of building it into weekly check
35
Mechanism Connects to database using Perl DBI Retrieves full MARC record Runs against correction module Replaces corrected record in database
36
Output Bib id: 662002 How to build a habitable planet ; By Wallace S. Broecker. 100: UPDATE: Spaces inserted between initials in subfield _a 245: UPDATE: By uncapitalised at start of subfield c 245: UPDATE: Space forward slash inserted before subfield _c 260: UPDATE: Full stop inserted at end of field 260: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Full stop inserted after the p in pagination 300: UPDATE: Full stop inserted at end of field 300: UPDATE: Illustration abbreviation has been corrected 300: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Space inserted between digits and cm 300: UPDATE: Space inserted between digits and p in pagination 300: UPDATE: Space semi-colon inserted before subfield c
37
Results In testing 70,000 records processed Corrected over 200,000 MARC coding errors May run ALL our existing records through at some stage
38
Deduplication – in progress! Three stages: Identification of groups of duplicates Identification/construction of ‘best’ record Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’
39
Identification of duplicates Connect to a database with Perl DBI Use SQL to retrieve records For each record, retrieve all available data from tables Use matching algorithm to identify groups of duplicates
40
And you’ll end up with something like this:
41
Identification of best record For each of group of duplicates, MARC records retrieved Passed to scoring algorithm Record with highest score forms basis of ‘best’ record Retains set fields (i.e. subject headings) from ‘other’ records Corrects any MARC coding errors
46
But … No relinking functionality, even in BatchCat No viable workaround for libraries using Acquisitions/without losing circulation history
47
In conclusion … Tools for librarians, not replacements! Do the stuff programs do well, allowing humans to concentrate on what humans do well Won’t do all the work, just makes a solution to major data problems feasible
48
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.