Download presentation
Presentation is loading. Please wait.
1
Data Cleaning with Open Refine:
An Introduction to Data Wrangling David B. Lowe Data Librarian, Evans Library, Florida Tech
2
To Download This Workshop’s Data Sets:
western-us Download “ResourceAssessmentSummaryData xlsx” (Note the Data Dictionary there, too.) Download “authors-people.csv” (Represents British Library’s Comic Book collection)
3
Agenda Data Science Ecosystem Goals Challenges Data Wrangling
Excel: typical task with clean data OpenRefine: Compare functionality with above task, then Demo with dirty data GREL video OR attendee data
4
Discussion: Diving in…
What’s wrong with this picture?
5
Coming to a DSL near you: Intro to Tableau 11/15
The Ecosystem Coming to a DSL near you: Intro to Tableau 11/15 From Biewald, Lukas, “The data science ecosystem part 2: Data wrangling,” in Computerworld, April 1, 2015, accessed September 8, 2016 from article/ /the-data-science- ecosystem-part-2-data- wrangling.html .
6
Existential Question: What’s the Goal of Data Science?
Unfamiliar data Familiar data Finding patterns* Identifying trends (pattern vs. expectation) *per Black Swan’s Dick Fear: blog/what-exactly-is-the- purpose-of-data-science- part-1/ Alertness is the hidden discipline of familiarity. --David Whyte
7
Existential Question: What’s the Goal of Data Science?
Unfamiliar data Familiar data Finding patterns* Identifying trends (pattern vs. expectation) Alertness is the hidden discipline of familiarity. --David Whyte *per Black Swan’s Dick Fear: blog/what-exactly-is-the- purpose-of-data-science- part-1/
8
Existential Question: What’s the Goal of Data Science?
Unfamiliar data Familiar data Finding patterns* Identifying trends (pattern vs. expectation) Alertness is the hidden discipline of familiarity. --David Whyte *per Black Swan’s Dick Fear: blog/what-exactly-is-the- purpose-of-data-science- part-1/
9
Existential Question: What’s the Goal of Data Science?
Unfamiliar data Familiar data Finding patterns* Identifying trends (pattern vs. expectation) Alertness is the hidden discipline of familiarity. --David Whyte *per Black Swan’s Dick Fear: blog/what-exactly-is-the- purpose-of-data-science- part-1/
10
Existential Question: What’s the Goal of Data Science?
Unfamiliar data Familiar data Finding patterns* Identifying trends (pattern vs. expectation) Alertness is the hidden discipline of familiarity. --David Whyte *per Black Swan’s Dick Fear: blog/what-exactly-is-the- purpose-of-data-science- part-1/
11
Infographic excerpt from 2015 survey of Data Scientists by Crowdflower:
12
Common Data Problems Data type mismatches Duplicate values
Misspelled values Concatenated fields (to be split) Split fields (to be concatenated)
13
Common Data Manipulation Needs
String searching and replacing Limiting by date (not to mention date and other data type conversion) Pivot tables Taking logs of numeric values Filtering to reveal only certain parameter ranges from a field’s set Excluding " " " " " " Applying matching heuristics of increasing fuzziness to inconsistencies Macros for frequent, complex transformations Rolling back/Undoing/Ctrl-z
14
A Familiar Excel Task Demo with Clean Data
Get data set from: western-us In Excel: Sorting columns Pivot table Pie chart
15
Details on OpenRefine <openrefine.org>
Formerly Freebase Gridworks, then Google Refine (their support ended 2012) Came to library world attention in Google Books scanning workflow (due to bibliographic metadata processing) Browser interface, but not a cloud operation-- Java app runs locally, so no cloud security concerns
16
Refine Caveats Open Refine ceiling maxes out at ~100,000 rows X 10 columns “not designed for huge data sets,” per original creator David Huynh Clunky interface (controls mostly on left, but one on right for a step) Limited column width control (so use short names) Known bugs like reconcile service error dialog box (opportunity there: currently a $250 bounty to fix that, see: reconciliation-was-freebase )
17
Alternatives OpenRefine vs. Spreadsheets
Columns and rows are the primary units of interaction. Editing happens one column at a time, across many rows matching some criteria. Cells are the units of interaction. Editing happens one cell at a time. Used primarily for exploring and transforming existing data Used primarily for entering data and performing calculations. Alternatives OpenRefine vs. Scripting You see the data visually at each intermediate step as you experiment with editing it and transforming it. You only see the original input data and the final output data. Used for understanding and then transforming data. Used primarily for transforming data. OpenRefine vs. Databases No schemas required, just like in spreadsheets. Serious overhead efforts must be invested in designing schemas. Data is always visible, like in spreadsheets. Data is mostly out of sight unless programming is done to expose views. From:
18
Language/Protocol Relationships for Refine
Data manipulation GREL (Google Refine Expression Language): for data operations such as concatenating fields Python/Jython Clojure (Lisp for Java environment) Macro functionality In JSON (Java Script Object Notation): shareable format of the history of commands issued per project; can be copied, forked for reuse, rolled back
19
Installing Refine Verify that you have a Java Runtime Environment (JRE) installed Download OpenRefine from openrefine.org OR (I’m recommending RC1) Unzip Run the .exe (or if that doesn’t work, the .bat) [At end, to quit Command screen, Ctrl-c]
20
Using OpenRefine In OpenRefine’s browser window, select: “Create Project” “Choose” by navigating to your .xls(x), .csv, .tsv, or other file Then “Next” If data appears to have loaded properly, Hit the “Create Project” button (on the RIGHT side) Triangles over each column indicate successful load
21
Refine Demo with Clean & Dirty Data
Clean (same data as with Excel earlier; apples-to-apples): western-us Note sorting, pivoting, limiting, editing, nesting ease vs. Excel Dirty: Faceting, clustering, assisted editing, undoing, managing JSON macros
22
[If time permits]: Refine with Attendee Data
[If time permits]: Refine with Attendee Data? Or Refine’s GREL: Quick tour of Regex macros (8:32 mins)
23
Thanks! Questions… For me? For you: What was the muddiest point?
Please share your feedback:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.