Using OpenRefine in Digital Collections: the Spencer Sheet Music Project Bruce J. Evans Cataloging & Metadata Unit Leader/Music and Fine Arts Catalog Librarian Baylor University Kara Long Metadata & Catalog Librarian Baylor University
Frances G. Spencer Collection of American Sheet Music
Cataloging & Metadata Overview
Card Catalog MARC Record Dublin Core Metadata & digital object
OpenRefine Interactive Data Transformation tool (IDT) Interactive like a spreadsheet – but more powerful Programmable like a database – but more exploratory Open source Runs locally in your browser But what can it do? Import and export data Facet data Transform data Reconcile data to outside data sets
Importing the data and creating a new project… MARC fields re-named and re-ordered Join fields where data is separated Separate fields where data is joined Re-format dates Remove unnecessary punctuation Add fields that required in digital collection
Columns are the primary units of interaction. The drop down menu of functions at the column level allows us to rename, reorder, or transform columns. Column names must exactly match our CDM field names in order for upload the metadata. MARC 100 Composer Renaming Columns
Columns must also exactly match the order that the corresponding fields appear in our CONTENTdm collection. Once all the fields have been re- named, they can be re-ordered under the All columns menu. Re-ordering Columns
Joining the 245$a and 245$b to create a Title field Transform data with Google Refine Expression Language (GREL) Expected value Joining Values
Adding a column based on an existing column. Values from the 260$c populate the new Date Search field, with unnecessary data removed. Adding a New Column
246 must be split into two or three separate fields: -Alternative Title -First line of verse -First line of chorus Splitting Values
The value in the First Line of Verse field always begins with the same phrase, “First line of text.” To create a new column with this portion only, split the value by a semi-colon, filter those values by the leading phrase. The same method will also isolate the First Line of Chorus values. Using “not” as a Boolean operator will isolate the Alternative Title values. Splitting Values
Extract the Operation History to automate your data transformation and clean up. Extract and save! Apply to new data sets that need the same kind of clean up. Isn’t it all a little tedious?
Invaluable resources Verborgh, Ruben, and Max De Wilde. Using OpenRefine. Birmingham: PACKT Publishing, Van Hooland, Seth, and Ruben Verborgh. Linked Data for Libraries, Archives, and Museums: How to Clean, Link, and Publish Your Metadata. Chicago: Neal-Schuman, 2014.