USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING

Slides:

Advertisements

Similar presentations

The essentials managers need to know about Excel

Advertisements

Library Staff Training

1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.

Database Design IST 7-10 Presented by Miss Egan and Miss Richards.

Access 2007 ® Use Databases How can Access help you to find and use information?

DAY 14: ACCESS CHAPTER 1 Tazin Afrin October 03,

Mail merge letters are used to send the same or similar documents to many different people. Since they contain the recipient’s name, address, and other.

Microsoft ® Office Excel ® 2003 Training Sorting and Filtering Data MorningStar Education presents:

OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.

*** CONFIDENTIAL *** © Toshiba Corporation 2008 Confidential Creating Report Templates.

Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,

Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.

NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. WHAT’S.

Milstats IUG 2008 Milstats 102: Beyond the Basics with Milstats Innovative Users Group 2008 Annual Conference Washington, D.C. Corey Seeman Kresge.

Inventory: Taking stock of your collection Inventory: Taking stock of your collection Judy Greenwood Interlibrary Loan Librarian University of Mississippi.

XP. Objectives Sort data and filter data Summarize an Excel table Insert subtotals into a range of data Outline buttons to show or hide details Create.

Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.

Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.

Assessing current print periodical usage for collection development Gracemary Smulewitz Distributed Technical Services Rutgers University Libraries.

Inventorying and Shelf Reading the Collection with Voyager Presenters: Doug Frazier, University Librarian & Ann Fuller, Head of Circulation & ILL Armstrong.

An electronic document that stores various types of data.

Using OpenRefine in Digital Collections: the Spencer Sheet Music Project Bruce J. Evans Cataloging & Metadata Unit Leader/Music and Fine Arts Catalog Librarian.

Inventory Projects An opportunity for catalog enhancement Sarah Hess Cohen Florida State University Music OCLC Users Group March 1, 2016.

Collection Evaluation and Management: Decision Center Launch & Reporter Update Amanda Schukle Product Manager.

Microsoft Excel 2007 Noris Bt. Ismail Faculty of Information and Communication Technology Tel : (Ext 8408) BCOMP0101.

Software Development Languages and Environments. Computer Languages Just as there are many human languages, there are many computer programming languages.

Creative Create Lists Elizabeth B. Thomsen Member Services Manager

Microsoft Access 2007 Introduction to Database. What is a Database? Database--A Collection of Tables Usually Associated With a General Topic Database.

Applied Software Project Management SOFTWARE TESTING Applied Software Project Management 1.

SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.

Advanced spreadsheet tips & tricks

Advanced Excel Helen Mills OME-RESA.

Microsoft Office Access 2010 Lab 1

Automating Accounts Payable

AP CSP: Cleaning Data & Creating Summary Tables

The FAST Report Scheduler

Agenda Learn about the new TJ Ministry Application & Registration System Mixture of PowerPoint Slides, Demo’s and Hands-on Goals: Learn Concepts Play.

Accessing the Catalog. An Introduction to Discovery: The New Catalog at the Dominican Theological Library.

Collection Management (CM) in small to medium-sized academic libraries

GO! with Microsoft Office 2016

7 ways to clean up the catalog

Computer Fundamentals

Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts

Industrial Assessment Center Database

GO! with Microsoft Access 2016

Validation Sample Study for a Shared Print Collection: Is It There?

Fearless Transformation: Applying OpenRefine to Digital Collections

Overview and Introduction to Alma Analytics

Gary R. Cocozzoli Lawrence Technological University

Tools and Techniques to Clean Up your Database

Tools and Techniques to Clean Up your Database

Titles subject area and Physical-Electronic overlap analysis

Tweaking an existing workflow

In Search of Useful Collection Metadata

ALEPH Version 22 Beginning Cataloging

Making ESL Rubrics Part 1

Introduction to Smart Search

Lecture 12: Data Wrangling

Tutorial 3 – Querying a Database

Reporting Based on Data in Archivists’ Toolkit

ArchivesSpace Migration

Sirena Hardy HRMS Trainer

Teacher Training Module Three Teacher Tools: Tools & Analysis

Is It Shelf Reading or Inventory?

Using GreenGlass to Support Collection Management

The Life-Changing Magic of OpenRefine

Excel 2003, Volume 2 by Karen J. Jolly

Action research: Meredith College’s carlyle campbell library

Introduction to Excel 2007 Part 3: Bar Graphs and Histograms

Lesson 13 Working with Tables

Presentation transcript:

USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING A case study from the University of Michigan Library Matt Carruthers Metadata Projects Librarian University of Michigan

OpenRefine openrefine.org OpenRefine is a great open source tool that is widely used for cleaning messy data openrefine.org

OpenRefine openrefine.org That's even the tagline for the software It's often employed in libraries to clean messy bibliographic data openrefine.org

(He’s lying.) But our bibliographic data at the University of Michigan is flawless (LOL) So we aren't interested in cleaning the data Rephrase: For one particular project, we weren't interested in cleaning data, but we needed to efficiently sort and filter through large sets of bibliographic data in order to make collection management decisions. So over the next ten minutes or so, I'm going to take you through a project we did recently at the library to assess the collections of six library buildings. I'll show you how the project started, how it changed once we implemented OpenRefine, and the impact that it had on our outcome.

Project Summary Identify resources in our collection for which we own more than one copy. Assess whether or not they are low-use items. Withdraw duplicate copies from the collection or send to remote storage. At UM Library, we decided to undertake a collection management project to identify resources where we had more than one copy in our collections, and assess if they were not often used or outdated based on certain criteria from collection managers. Once those items were identified, one or more of the duplicate copies would be withdrawn from the collection or sent to remote storage to free up space on our shelves at the libraries for new collection growth.

Project Summary DataMart Library’s reporting tool for extracting data from our ILS. Reports are generated in spreadsheet format. Spreadsheets contain a mix of bibliographic information and usage statistics. Source of data: Datamart - library's data reporting tool to extract reports from the ILS Select call number ranges - spreadsheets produced are anywhere from 100,000 rows to 500,000 rows Spreadsheets contain bibliographic information as well as circulation statistics

Project Summary Find items which: Have more than one copy in the Library. Are not part of a series. Are at least five years old. Have circulated less than six times. Are not “in process”. Are not attached to a provisional cataloging record. Are part of the circulating collection. Selection criteria: Find items which: Have more than one copy in the library Are not part of a series Are at least five years old Have circulated less than 6 times Are not "in process" - aren't currently on loan, in conservation for repair, labelled as missing, etc. (i.e. must be on the shelf) Are not attached to a provisional record Must be part of the circulating collection (i.e. nothing from special collections or reference resources)

In the Beginning Started using Excel for processing: Used sorting, filters, and conditional formatting to identify and isolate duplicates that met the criteria General metric - at least 2 hours of staff time to process 100,000 rows because we had to manually rerun each command to process each spreadsheet After doing this for a while and seeing how time consuming, error prone, and generally not fun this process actually was, we started looking for a more efficient way to process the spreadsheets. So we turned to OpenRefine to essentially fully automate the processing Using a combination of the facet functions and various data transformation commands, we cut that processing time down to about 15 seconds per 100,000 rows. So how did we do that? I won't take you through the process in OpenRefine step-by-step because there are over 20 commands from start to finish. That may sound a little daunting, but it actually took less than an hour to set up the process, and OpenRefine actually a lot of the heavy lifting for you.

So I'll take you through some of the highlights of the process: First, we start with a typical spreadsheet of about 100,000 rows. We use a combination of the facet feature (which allows you to quickly identify and edit only the rows you are interested in) and custom transformation commands written in the Google Refine expression language.

For example, we can filter out anything that is part of a series using the customized facet command "facet by blank" on the Description column, since any values in that column indicate things like volume number. So any row with a value (i.e. not blank) in the Description column can be deleted.

Additionally, we can use a custom transformation written in the Google Refine Expression Language (or GREL) to calculate how old a resource is.

You can save command sets to run automatically on additional spreadsheets.

Outcomes Cut processing time by approximately 99.8%. Cut weekly staff time on project from 34 hours to 7 hours. Processed over 2 million items. Withdrawn 9,000 low-use duplicates from our collections. Outcomes: Used to take 2 hours to process 100,000 items, now takes 15 seconds using OpenRefine Cut processing time by approximately 99.8% We had been spending 34 hours of staff time a week on this project, now we can cut it down to 7 hours per week and still be much more efficient - so we can free up staff to do other things Allows us to identify little used duplicate items faster than we have ever been able to do We can now make more informed collection management decisions much more quickly Processed over 2 million items Withdrawn 9,000 duplicate items from our collection Requires no programming expertise or special skills, so anyone can run the processing

Questions? OpenRefine command scripts on Github: https://github.com/mcarruthers/Operations_dedup mcarruth@umich.edu @mattadata2 Project is ongoing and we are beginning to implement OpenRefine for other projects in similar ways You can check out my Github repo for the command script used in this project I also have a few other repositories for other OpenRefine command scripts for various other projects