Data Cleaning using OpenRefine

Slides:



Advertisements
Similar presentations
The essentials managers need to know about Excel
Advertisements

KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Understanding Microsoft Excel
EXCEL Spreadsheet Basics
Newsletter Plugin The newsletter plugin allows you to create and send newsletters to a managed list or multiple lists of users. Your users can subscribe.
1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
XP 1 ﴀ New Perspectives on Microsoft Office 2003, Premium Edition Excel Tutorial 1 Microsoft Office Excel 2003 Tutorial 1 – Using Excel To Manage Data.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 1 1 Microsoft Office Word 2003 Tutorial 1 – Creating a Document.
PowerPoint: Tables Computer Information Technology Section 5-11 Some text and examples used with permission from: Note: We are.
1 Modified_ CIRCUIT PROTECTION SOLUTIONS Confidential and Proprietary to Littelfuse, Inc. ® Littelfuse, Inc All rights reserved. February,
Introduction to VBA. This is not Introduction to Excel We’re going to assume you have a basic level of familiarity with Excel If you don’t, or you need.
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
Excel 2007 Part (2) Dr. Susan Al Naqshbandi
Downloading and Installing PAF Insight PAF Insight can be easily downloaded Or can be installed from a CD A license is needed t0 activate the program.
Key Applications Module Lesson 16 — Excel Essentials Computer Literacy BASICS.
IAGAP Access Database A Tutorial. Databases There are several databases available from the IAGAP Project. There are several databases available from the.
Lesson No:9 MS-Word Tools, Mail Merge and working with Tables CHBT-01 Basic Micro process & Computer Operation.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Mail merge I: Use mail merge for mass mailings Perform a complete mail merge Now you’ll walk through the process of performing a mail merge by using the.
IC 3 BASICS, Internet and Computing Core Certification Key Applications Lesson 10 Creating and Formatting an Excel Worksheet.
Lesson 17 Getting Started with Access Essentials
CREATING TEMPLATES CREATING CUSTOM CHARACTERS IMPORTING BATCH DATA SAVING DATA & TEMPLATES CREATING SERIES DATA PRINTING THE DATA.
Key Applications Module Lesson 21 — Access Essentials
Support.ebsco.com Introduction to EBSCOhost Tutorial.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Working with Data Lists.
Learning PowerPoint Presenting your ideas as a slide show… …on the computer!
Word Create a basic TOC. Course contents Overview: table of contents basics Lesson 1: About tables of contents Lesson 2: Format your table of contents.
Transportation Agenda 77. Transportation About Columns Each file in a library and item in a list has properties For example, a Word document can have.
Key Applications Module Lesson 22 — Managing and Reporting Database Information Computer Literacy BASICS.
Access Queries and Forms. Adding a New Field  To insert a field after you have saved your table, open Access, and open the table  It is easier to add.
For Datatel and other applications Presented by Cheryl Sullivan.
Enlisted Association of the National Guard of the United States Data Extract Instructional Guide.
Perform a complete mail merge Lesson 14 By the end of this lesson you will be able to complete the following: Use the Mail Merge Wizard to perform a basic.
Welcome to the Basic Microsoft Word Guide. Before you start this Guide, you will need to complete “Basic Computer”; “Basic Windows” and know how to type.
Mail Merge Introduction to Word Processing ITSW 1401 Instructor: Glenda H. Easter Introduction to Word Processing ITSW 1401 Instructor: Glenda H. Easter.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Introduction to EBSCOhost
Understanding Microsoft Excel
A step-by-Step Guide For labels or merges
Understanding Microsoft Excel
Miscellaneous Excel Combining Excel and Access.
Setting up Categories, Grading Preferences and Entering Grades
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
University of technology Department of Materials Engineering
PIVOT TABLE BASICS.
Configuring Applications
Introduction to Microsoft Access
Reviewing Documents Guided Lesson.
Core LIMS Training: Advanced Administration
PowerPoint: Tables and Charts
Saving, Modifying page, grammar & spell checking, and printing
Adding Assignments and Learning Units to Your TSS Course
Collaboration with Google Docs
Microsoft Word Reviewing Documents.
Understanding Microsoft Excel
MODULE 7 Microsoft Access 2010
Microsoft Official Academic Course, Access 2016
CERNER MILLENNIUM Diagnoses and Problems
KIDS IEP & DATA MANAGEMENT TRAINING
Understanding Microsoft Excel
Introduction to EBSCOhost
Key Applications Module Lesson 16 — Excel Essentials
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms
Chapter 8 Using Document Collaboration and Integration Tools
Excel Tips & Tricks July 18, 2019.
Presentation transcript:

Data Cleaning using OpenRefine C. Tobin Magle, PhD Feb. 13, 2018 10:00-11:30 a.m. Morgan Library Computer Classroom 175 Hi and welcome to Data and Donuts, I’m Tobin Magle, the Data Management specialist at CSU’s Morgan Library. Today’s session is about how to clean up “messy” data using a tool called OpenRefine. *inspired by content from Data Carpentry

Data are messy Inconsistent labels - misspellings, white space Values out of range Combined variables We need tools like OpenRefine because data are messy, especially if its entered by hand. It can include things like misspellings, extra spaces, values that don’t make sense, and variables combined into one column.

What is data cleaning? Identifying and correcting errors Making format consistent Keeping track of what you did Luckily, we have data cleaning software that can help with identifying and correcting errors, making formats consistent, and leaving a paper trail of what you did to the data.

Open Refine Doesn’t modify original Tracks changes Batch cleanup Easily reversible Slide 3: As I mentioned before, the tool we will be talking about today is OpenRefine. This software has some very useful features. - First, it doesn’t alter the raw data, which is good for data integrity - It also tracks all of the steps you took and can apply these steps to other data sets - It allows you to format entire columns, intead of having to change each cell by hand. - Finally, If you make a mistake, the changes are easily reversible

Survey data Rows: observations of individual animals Columns: Variables that describe the animals Species, sex, date, location, etc Messy Data Misspellings White space Combined variables But before we get into using OpenRefine, let’s look at the data we’ll be using in this session. The rodents survey file contains data collected about animals in a field study. · Each row is an observation of individual animal. · Each column contains information about these animals, such as o the species and sex of the animal o the date and location of the observation · However, these data are messy o the data contain misspellings, especially in the species name column o There are also extra spaces in the text fields, o and columns that contain multiple variables.

Create a project File: https://tinyurl.com/jwtqy4w Let’s get these data into OpenRefine! To get started, you’ll need to create a project using a spreadsheet. There are a couple of ways to do this. Demo 1: · Then, start the OpenRefine application. The interface will open in your web browser. · Either download the file (https://tinyurl.com/jwtqy4w) and select “This computer” to load the file. · OR Select Web Addresses(URLs) and paste the link (https://tinyurl.com/jwtqy4w) in the box. · Preview loads (NOT A SAVED PROJECT YET) · Select file format under “Parse data as” (in this case .csv) · Rename project · Click “create project” when the data look how you want them · Number of rows listed above the table

Faceting Select column>Facet Select text facet 1 Faceting 2 3 Select column>Facet Select text facet Look at possible values of the column on the left Edit the facets 4 One of the most powerful features of Open Refine is what’s called faceting. Faceting is a great way to check for errors in your data. Creating a text facet will generate a list of all the unique values that have been entered into a column. This allows you to easily identify inconsistencies, such as spelling errors, in your data. Demo 2: Let’s try faceting the scientificName column · click the blue triangle next to the column name · Mouse over facet and Select text facet. · The list will appear to the left of your data. · You can edit the values here, and they will be changed in the data. · Look at Ammospermophilis harrisi. There are 3 very similar facets, spelled slightly differently. · You can change the spelling to the correct spelling by mousing over the misspelled facet, clicking edit, and correcting the spelling in the pop-up window

Exercise 1: Using faceting, find out how many years are represented in the census. Which years have the most and least observations? Exercise 1: Using faceting, find out how many years are represented in the census. · Facet year column · Inspect list · See there are 26 choices Which years have the most and least observations? · Sort the facets by count (click on count) · Look at first and last entry

Clustering “finding groups of different values that might be alternative representations of the same thing” a.k.a. spellcheck Key collision metaphone3 OpenRefine also has clustering algorithms that help you find groups of values that might represent the same thing. It’s like a more efficient way of doing what we did in the facet example above. I like to think of this as “spell check” rather than editing by hand. Demo 3: · Faceting the scientificName column, · Click “cluster” on the upper right of the facet window. · A new window will appear with options for clustering methods and keying functions. · In practice, you can play around with them to see what makes the best cluster for your case. · For this dataset, the best is the “key collision” method and the “metaphone3” keying algorithm. · The clusters will appear in the window. If you want to accept the clustering, click the checkbox next to the cluster and make sure the spelling on the right is accurate. · Click merge selected and recluster until they don’t find more relevant clusters

Undo/Redo All your steps are saved! Click where it says Undo / Redo Left frame Click on the step to revert to Result: data change. As I said previously, OpenRefine keeps track of everything you do. If you want to see your data cleaning history, go to the left-hand frame where we were working with facets and select the Undo/Redo tab. You can click on each step and revert to newer steps simply by clicking the step. Note that the data on the right changes when you do this.

Split Edit Column > Split Put space as separator 1 Split 2 3 Edit Column > Split Put space as separator Result: new columns 4 We can also split columns that contain more than one variable into multiple columns. Let’s split the scientific name column into one column for genus, one for species Demo 4: · Click the blue arrow to the left of the scientificName column · Select Edit column · Then select split into several columns · Pick a separator. In this case “ “ · Choose whether you want to keep the original column. I like to keep it, because I can use it to make sure the split worked as expected the split and can always delete it later. · Once you hit “ok” the new columns will be added to the spreadsheet. See the new columns with genus and species names o Why are there 4 columns? (whitespace before scientific name) o Undo split from the undo/redo tab 5

OpenRefine contains built-in functions that are commonly used in data cleaning, such as removing extra whitespace. To do this, Click on the Blue triangle to the left of the column heading that you are interested in. Then mouse over Edit Cells, then Common Transforms, then select Trim Leading and Trailing Whitespace. Demo 5: Remove whitespace · Click on the Blue triangle to the left of the scientificName column heading · Then mouse over Edit Cells, · then Common Transforms, · then select Trim Leading and Trailing Whitespace. · Redo split · Now that we can see that the split worked, remove the speciesName column by going to “Edit column” > “Remove this column” · You can rename the genus column by going to “Edit column” > “Rename this column” to genus

Exercise 2: Try to change the name of the second new column to “species” How can you correct the problem you encounter? Exercise 2 • Try to change the name of the second new column to “species”. • Already a column named species • How can you correct the problem you encounter? • Create more descriptive names • Rename species to “speciesAbr” • Rename scientificName2 to species

Filtering: by facet Facet the species column Click on a facet Exclude/include So far, we’ve been working with the entire dataset. But what if you only want to look at part of the data? This process is called Filtering. You can do this two ways: 1. If you want to select all records in a specific facet, you can click on the facet Demo 5: · Facet the species_abr column · Click on the facet you’re interested in · Data changes to contain only records in that facet

Filtering: Text filter Example: Find all records collected in Hawaii Unstructured data: many facets contain “Hawaii” Text filter = “Hawaii” : But what if the data you want don’t correspond to a facet? For example, think about unstructured text like the locality column. What would you do if you wanted to find all measurements made in Hawaii? For this you can use the “text filter” option. Demo 6: · Select the locality column and click “text filter”. A box will pop up on the left-hand frame. · Type in the text you want to search for, in this case ‘hawaii’. · Now when you look at the locality facets above, they all say Hawaii somewhere (not case sensitive). · Be careful, because all of these are exact text matches, so you might lose some that have misspellings · Can use regular expressions

Exercise 3: Goal: find all years in the 1980s where measurements were taken Facet on year Create a text filter to get data from 1980s Exercise 3 • Goal: find all years in the 1980s where measurements were taken • Facet on year • Create a text filter to get data from 1980s • All years from the 1980s have entries (look at the facets)

Sorting In addition to filtering the data, OpenRefine also allows you to sort the data as text or a number. Demo 7: Sort by month (mo) · Facet on species_abr · Filter on the AH facet · Click the blue arrow to the left of “mo” · Select Sort · Pick how you want the cell values sorted. Since the mo column contains numbers, we’ll do ‘numbers’. (Note, the results will be different). · Then click ok · Now all of the values in month are in order · Redo the sort as text. Note that the sort changes

Remove Sorting : Now that a sort has been applied, OpenRefine gives you more options. For one, you can remove the sort. This function is important, because it returns the data to its original order, even if it was in no particular order to begin with. In programs like excel, your only option would be to hit undo immediately after sorting. Demo 8: Remove sort · Return to the sort menu, and click “remove sort” · Now the data return to their original order.

Exercise 4: sort by multiple columns Sort by year then month. What order are the entries in? Sort by year, month, and day What happens when you remove the sort on month column? Exercise 4- sorting multiple columns · Sort by year and month. What order are the entries in? o Year takes precedence (months sorted within years) o Unlike how sorting works in excel · Sort by year, then month, then day o Year takes precendent, o months sorted within years o Day sorted within months · What happens when you remove the sort on the second column? o Days are sorted within years, months out of order.

Numbers in OpenRefine Green! By default, OpenRefine imports all data as text. However, it does have special functions for numeric data. To use them, you have to tell OpenRefine that a column contains numbers. Demo 9: Make the year column into a number. · Select the blue arrow next to record ID · Select Edit Cells · Select Common Transforms · Select “To Number” · You can tell it works if the numbers turn green · Make sure all filters are removed: if not, only the filtered data will be converted Green!

Exercise 5 Convert 3 other columns to numbers (include period) What happens when you try to convert a non-numeric column? Exercise 5 · Convert 3 other columns to numbers (include period) o Year, month, date and period · What happens when you try to convert a non-numeric column? o Nothing! o Different from R, where if it can’t coerce the value to a number, the data all turn into NAs

Numeric Facet : Now that we have some columns designated as numbers, we can do some really useful things with Numeric Facets. To create a numeric facet: Demo 10: · Select the blue arrow next to year · Select Facet · Select numeric facet · Now, instead of a list of names as facets, you get the range of numbers with slider bars · Slide the bar to the range you want to include

Exercise 6 In a numeric column, replace a number with text (such as abc) and one with a blank Create a numeric facet for this column How is this different than the numeric facet for “Year”? Exercise 6 · In a numeric column, replace a number with text (such as abc) and one with a blank · Create a numeric facet for this column · How is this different than the numeric facet for “Year”? o Checkboxes for non-numeric and blank data o Check/uncheck boxes to filter

Saving Scripts Export the steps for reuse In the Undo / Redo section, click Extract Select the steps you want to keep Save code as .txt file using a text editor Now we’ve done all this work in OpenRefine, but it is stored inside the program. So how do we extract this stuff? First, I’ll show you how to save the steps you’ve done in the Undo/Redo tab. Demo 11: · In the Undo/redo tab, click extract. · Select the steps you want to keep. o This generates JSON code that specifies these steps. o Uncheck and recheck some boxes to see the code change · To save the code, copy and paste it into a text editor and save it as a .txt file.

Applying Scripts Run the same steps on a similar document Click apply Paste in code Paste Slide 24: Once you have these steps saved, you can apply them to similar files. So if you collect the same types of data over and over again with the same column headings and the same data cleaning steps needed, you can apply these scripts instead of having to point and click through the whole thing every time. Demo 12: · Go into the undo/redo tab · Click apply · Past in the contents of the text file · Click Perform operations · The data should change to reflect those steps and the steps will be in the undo/redo history.

Saving and exporting a project Autosave feature Click 'Export' button (top right) Select 'Export project' Result: a compressed file that contains Data Cleaning steps Slide 25: We can also export the steps and the data together by exporting a project. You’ve noticed that throughout this process I haven’t clicked save at all. This is because OpenRefine is autosaving everything you do as you go. But if you want to get your work off of your computer you need to export the project. Demo 13: · Click the Export button at the upper right hand side of the screen · Click export project. · The program will automatically download a compressed file that contains all your data and the cleaning steps.

Importing a project Found in the menu where you create/open projects Loads data and history Slide 26: Now that your project has been exported, anyone who has OpenRefine can view it just by importing the project. Demo 14: · Click open in the upper right, a new window will open · Select Import Project on the left hand side of the window · Open the compressed file created by the export and rename it if you would like · The cleaned data and history and the data should be loaded.

Exporting data Go to 'Export' in the top right. Click on the file type you want to export the data in. 'Tab-separated values' 'Comma-separated values' Slide 27: Now let’s talk about exporting your cleaned data. Not everyone is interested in everything you’ve done with your data. Sometimes, they only need the final product. Thus, you can also export your cleaned data using OpenRefine. Demo 15: · Click on the export menu on the top right · Click on the type of data file you want to export. We suggest .csv or tab separated text. · The program will download the data in the selected format automatically.

Need help? Email: tobin.magle@colostate.edu Data Management Services website: http://lib.colostate.edu/services/data-management Data Carpentry: http://www.datacarpentry.org/ OpenRefine Lesson: http://www.datacarpentry.org/OpenRefine-ecology-lesson/ Thanks for listening. If you need any help with these exercises, don’t hesitate to email me at tobin.magle@colostate.edu. You can also visit our data management services website to see what else we do with regard to data management. Also, if you want to see the lessons these were based on, visit the data carpentry website and view the lessons in a bit more detail. Thanks!