Data Cleaning, Validation and Enhancement iDigBio Wet Collections Digitization Workshop March 4 – 6, 2013 KU Biodiversity Institute, University of Kansas.

Slides:



Advertisements
Similar presentations
Learningcomputer.com. Using this Tab, you can import data from external sources including but not limited to: Text files Microsoft Access databases Web.
Advertisements

With Folder HelpDesk for Outlook, support centres and other helpdesks can work efficiently with support cases inside Microsoft Outlook. The support tickets.
Importing GPS Data Lecture 13. EasyGPS  Free software for downloading waypoints  EasyGPS ( EasyGPS  Free software for downloading.
Introduction to Mendeley. What is Mendeley? Mendeley is a reference manager allowing you to manage, read, share, annotate and cite your research papers...
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Google Refine Tutorial April, Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial.
Atlas III Improvements Expands on Atlas II capabilities – Faceted Navigation – counts are displayed next to selectable attribute – Lunar Map interface.
Morphbank Image Repository Plus… Paleocollections Workshop April 26 – 28, 2012 Deborah L. Paul Support from NSF grants: Biological Databases.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Technology in the Language Learning Classroom Angelika Kraemer Dennie Hoopingarner Scott Schopieray FLTA Orientation 2011.
Tutorial 8 Sharing, Integrating and Analyzing Data
Using Social Care Online: an overview Version 1.0 April 2015.
Advanced Multimedia Storytelling. Mapping: important terms  Geocoding: turning addresses into map co-ordinates (usually latitude and longitude) that.
Property of Cracking Siebel MS Excel Tool Column-To-Query.
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
Roles and Goals Greg Riccardi. iDigBio People University of Florida o Larry Page, Jose Fortes, Pamela Soltis, Bruce McFadden, Renato Figueiredo, Reed.
Georeferencing Train-the-Trainers Survey Results Selected Findings.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
Cleaning Validating and Enhancing Data with Open Refine iDigBio – University of Florida 2 nd Train-the-Trainers Georeferencing Workshop Deborah Paul, iDigInfo.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Tame Your Data with OpenRefine GIL User Group Meeting May 14 th, 2015 Tricia Clayton Collection Services Librarian Georgia State University.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
The Macroalgal Digitization Project Chris Neefus, Department of Biological Sciences University of New Hampshire, Durham, New Hampshire.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
University of Florida Florida State University
 Saundra Speed  Mariela Esparza  Kevin Escalante.
Raw Data Cleaning, Validation and Enhancement The Field Museum - Chicago, Illinois iDigBio Entomology Digitization Workshop Deborah Paul, iDigBio April.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Getting Started with EndNote. EndNote Fundamentals EndNote is a reference organizer Build a library of references Cite references and generate bibliographies.
Wiki Training: Introduction to Instructor: Zach Silveira (415)
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
1 EndNote X2 Your Bibliographic Management Tool 30 September 2009 Aaron Tay Tel: /30
Data Creation and Editing Based in part on notes by Prof. Joseph Ferreira and Michael Flaxman Lulu Xue | Nov. 3, :A Workshop on Geographical.
Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
11 TRAINING COURSE ON MALARIA ELIMINATION FOR THE GMS Databases Ryan Williams Chang Mai, August 2015.
Geographic data validation. Index Basic concepts Why do we need validation? How to assess geographic data Initial checks Intermediate checks Advanced.
IDigBio Georeferencing Working Group (GWG) Summit 2012 October 23 – 24 reporting: David Bloom, Shari Ellis, Debbie Paul.
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
TAILS Phase 5 TAILSTAILS racking ntegrated ogging ystem nd 1 TAILS Introduction to Reporting.
Train-the-Trainers 2 Workshop Overview August, 2013 iDigBio, Gainesville, Florida (What have we gotten ourselves into?)
Predicting Near Space Flights L. Paul Verhage 13 July 2013.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
XP New Perspectives on Microsoft Office Access 2003, Second Edition- Tutorial 8 1 Microsoft Office Access 2003 Tutorial 8 – Integrating Access with the.
NPGS Georeference Project Stephanie L. Greene, Prosser, WA.
Google maps engine and language presentation Ibrahim Motala.
Geocoding Chapter 16 GISV431 &GEN405 Dr W Britz. Georeferencing, Transformations and Geocoding Georeferencing is the aligning of geographic data to a.
Essex Insight Introduction to Essex Insight Training Guide Source: Research and Analysis Unit v4.
AdisInsight User Guide July 2015
WeatherSTEM Data Mining Tool
Using Social Care Online: an overview
A Look at Creating & Updating Point Files
<Dissertations>
Using Python to Interact with the EPA WATERS Web Services
Tutorial 11: Connecting to External Data
Microsoft Access 2003 Illustrated Complete
Chapter 4 Application Software
Introduction: Lab Workbook
Data Management: The Data Repatriation Re-integration Step or …
Microsoft Office Access 2003
Microsoft Office Access 2003
MIS2502: Data Analytics Semi-structured Data Analytics
RSA 2019, Toronto Preconference day March 16, AM-1PM
Presentation transcript:

Data Cleaning, Validation and Enhancement iDigBio Wet Collections Digitization Workshop March 4 – 6, 2013 KU Biodiversity Institute, University of Kansas – Lawrence Deborah Paul

Pre & Post-Digitization Exposing Data to Outside Curation – Yipee! Feedback Data Discovery dupes, grey literature, more complete records, annotations of many kinds, georeferenced records Filtered PUSH Project Scatter, Gather, Reconcile – Specify iDigBio Planning for Ingestion of Feedback – Policy Decisions re-determinations & the annotation dilemma to re-image or not to re-image “annotated after imaged” to attach a physical annotation label to the specimen from a digital annotation or not

Data curation / Data management querying dataset to find / fix errors kinds of errors filename errors typos georeferencing errors taxonomic errors identifier and guid errors format errors (dates) mapping

Clean & Enhance Data with Tools Query / Report / Update features of Databases Learn how to query your databases effectively Learn SQL (MySQL, it’s not hard – really!) Using new tools Kepler Kurator – Data Cleaning, Data Enhancement Kepler Kurator Open Refine Open Refine, desktop app Open Refine Open Refine from messy to marvelous remove leading / trailing white spaces standardize values call services for more data just what is a “service” anyway? the magic of undo Google Fusion Google Fusion Tables Google Fusion Google Fusion

OpenRefine A power tool for working with messy data. Got Data in a Spreadsheet,…? TSV, CSV, *SV, Excel (.xls and.xlsx), JSON, XML, RDF as XML, Wiki markup, and Google Data documents are all supported. the software tool formerly known as GoogleRefine

Install

Enhance Data Call “web services” GeoLocate example your data has locality, county, state, country fields limit data to a given state, county build query eolocatesvcv2/glcwrap.aspx? &Locality="+escape(value,'url') " eolocatesvcv2/glcwrap.aspx?Country=USA&state =fl&fmt=json&Locality="+escape(value,'url') service returns json output latitude, longitude values now in your dataset. Google Fusion tables

Parsing json How do we get our longitude and latitude out of the json? Parsing (it’s not hard – don’t panic)!

Parsing json Copy and paste the text below into { "engineVersion" : "GLC:4.40|U: |eng:1.0", "numResults" : 2, "executionTimems" : , "resultSet" : { "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [ , ]}, "properties": { "parsePattern" : "Miles East of TALLAHASSEE", "precision" : "Low", "score" : 36, "uncertaintyRadiusMeters" : 20330, "uncertaintyPolygon" : "Unavailable", "displacedDistanceMiles" : 2, "displacedHeadingDegrees" : 90, "debug" : ":GazPartMatch=False|:inAdm=False|:Adm=LEON|:orig_d=2 MI|:NPExtent=29301|:NP=TALLAHASSEE|:KFID=FL:ppl:4006|TALLAHASSEE" } }, { "type": "Feature", "geometry": {"type": "Point", "coordinates": [ , ]}, "properties": { "parsePattern" : "Miles East of %LEON COUNTY%", "precision" : "Low", "score" : 31, "uncertaintyRadiusMeters" : 17244, "uncertaintyPolygon" : "Unavailable", "displacedDistanceMiles" : 2, "displacedHeadingDegrees" : 90, "debug" : ":GazPartMatch=False|:inAdm=False|:Adm=LEON|:orig_d=2 MI|:NPExtent=24140|:NP=LEON COUNTY|:KFID=|LEON COUNTY" } } ], "crs": { "type" : "EPSG", "properties" : { "code" : 4326 }} } }

Copy json output in the spreadsheet, paste it here. Click on process button (lower right of this screen).

Parsing json

Parsing latitude

Parsing longitude

The Results!

How to begin? This powerpoint and accompanying CSV OpenRefine videos and tutorials Join Google+ Open Refine CommunityGoogle+ Open Refine Community Google Fusion Tables Coming iDigBio from the GWG Teach others about these power tools Pay-it-forward! Data that is “fit-for-research-use” & fun

Have fun with the data no matter where you find it!

Thanks for coming! Special thank you to Katja Seltmann, John Wieczorek, Nelson Rios, Guillaume Jimenez, Casey MacLaughlin, and Kevin Love for light and illumination, for teaching, mentoring, and helping me to empower others to get the most and very best out of the data – and have some fun at the same time! iDigBio is funded by a grant from the National Science Foundation's Advancing Digitization of Biodiversity Collections Program (#EF ). Views and opinions expressed are those of the author not necessarily those of the NSF.