Download presentation
Presentation is loading. Please wait.
1
The Life-Changing Magic of OpenRefine
The Open-Source Art of Data Decluttering and Organizing Maristella Feustle MOUG - Portland, Oregon January 30, 2018
2
Today’s data sets
3
Why OpenRefine? Establishing meaningful relationships within and between datasets requires that all of the data are being read as intended. Use OpenRefine to: Clean up data - standardize, correct, rearrange Automate tedious editing Match with outside controlled vocabularies Prep and export data for use in programs like MarcEdit, Tableau, Gephi, Raw, CartoDB, Google FusionTables, and more
4
Before and after
5
A brief history Freebase Gridworks Google Refine OpenRefine
GREL - General Refine Expression Language Many versions to choose from. Even “release candidate” versions are useful. You can do a lot before ever getting to GREL.
6
Creating a project Data can be as simple as an Excel spreadsheet inventory, or even a text file Data “lives” on your computer - security, preservation pros and cons MarcEdit can export TSV or JSON files to OpenRefine. Memory issues with manipulations of very large files TSV = Tab Separated Value.
8
Then what? How you proceed depends on:
The final form you want your data to have How much intervention your data requires to get there
9
*Disclaimer Maristella mainly works in archival descriptive standards. She has not been evaluated by the FDA (FRBR’s Dedicated Aficionados) and is not intended to diagnose, treat, cure, or prevent any RDA- related maladies. Consult your cataloging professional for advice.
10
Project 1: Tabular data Columns are the basis of organization in OpenRefine. If your data is already in rows and columns (Excel, CSV, TSV, etc.), there is little translation to do in importing a project: “What you see is what you get.” Check encoding for special characters. Specify a first row as header, if applicable.
11
Basic transformations
Moving things around Editing data Demonstration: Column order, sorting, facets, clustering. Computers will not gloss over differences that human interpretation recognizes as “the same.” OpenRefine provides a way to ensure like items are counted together by finding similarities and offering an easy way to merge/standardize them. The good news is, the computer will do exactly what you tell it to do. The bad news is, the computer will do exactly what you tell it to do.
12
Lather, rinse, repeat Can extract command history to reuse on other datasets.
13
Fun with GREL Many alternatives exist to writing code from scratch for common transformations:
14
Project 2: MARCEdit Exports
Source: Use MARCEdit / MARCBreaker to parse raw MARC. MARCEdit exports to OpenRefine as JSON and TSV. Use Facet function to isolate MARC fields.
15
Project 3: Bibframe 1.0 Source: Thanks to Karen Coyle. Finding any Bibframe example was a tall order.
16
Export TSV and other formats.
17
Other resources: Terry Reese: MarcEdit and OpenRefine: OpenRefine.org, book and videos: Free Your Metadata: OpenRefine Reconciliation services: More reconcilable data sources: Overdue Ideas: “A Worked Example of Fixing Problem MARC Data”:
18
If you can read this, you don’t need glasses.
Thank you! If you can read this, you don’t need glasses.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.