Data Wrangling and Interoperability Andrea Denton Research and Data Services Manager Claude Moore Health Sciences Library Ricky Patterson.

Slides:



Advertisements
Similar presentations
Focus on Your Content, Not on Ingesting Your Content Terry Brady Applications Programmer Analyst Georgetown University Library
Advertisements

Setting up an E-XL A Step by Step Tutorial Engineering Consultants Group, Inc.
AgMIP SSA Meeting Accra, Ghana 12 September, 2012 Importing and translating crop model data.
WESS Search Tool The Safety Mishap Analysis & Retrieval Tool (SMART) allows customers to search the Naval Safety Center safety data repository. Example:
OVERVIEW & LIBRARY SUPPORT FOR DATA MANAGEMENT/SHARING Jim Van Loon, MSME/MLIS Science Librarian.
Making the Case for Metadata at SRS-NSF National Science Foundation Division of Science Resources Statistics Jeri Mulrow, Geetha Srinivasarao, and John.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Welcome to EDINA Digimap Digimap is an EDINA service offering online access to a range of spatial data. It is authenticated using Athens and is available.
Chapter 4 Teaching with the Basic Three Software Tools: Word Processing, Spreadsheet, and Database Programs M. D. Roblyer Integrating Educational Technology.
Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.
Chapter 9: Databases Section IV: Using Problem Solving Tools to Enhance Learning.
Introduction to Using Precise Math Language
Coaching for School Improvement: A Guide for Coaches and Their Supervisors An Overview and Brief Tour Karen Laba Indistar® Summit September 2, 2010.
Database Software Application
Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.
Management, marketing and population of repositories Morag Greig, University of Glasgow.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
1 Overview SUNY Business Intelligence Initiative (SBII) Library Dashboards Circulation Analysis Collection Analysis.
2 2 CHAPTER Application Software. Competencies 1. Common software features 2. Word processors 3. Spreadsheets 4. Database management systems 5. Presentations.
Metadata: An Overview Katie Dunn Technology & Metadata Librarian
Now with WebReporter!. KeyBoarding Pro Deluxe 2 Canadian, reliable, and easy-to-use – and it just keeps getting better! Over 120 Keyboarding and Word.
Introduction to Versioning
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
Data Management: Documentation and Metadata for Engineering and Physical Sciences Ivey Glendon, Metadata Librarian Jeremy Bartczak, Intellectual Access.
Workflow Systems for Life Sciences and Social Sciences
Best Practices for Collecting Data Bill Corey Data Consultant University of Virginia Library Andrea Horne Denton Health Sciences Data.
FAMILY AND CHILDREN’S TRUST FUND (FACT) RESEARCH AND DATA MATERIALS.
UVa Library Research Data Services
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Portal User Group Meeting June 13, Agenda I. Welcome II. Updates on the following: –Migration Status –New Templates –DB Breakup –Keywords –Streaming.
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
12 Developing a Web Site Section 12.1 Discuss the functions of a Web site Compare and contrast style sheets Apply cascading style sheets (CSS) to a Web.
Journalism & Media Studies Graduate Student Culminating Work : Steps for Submitting to the Campus Digital Archive at USFSP November 21, 2011 by Carol Hixson.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Series 2013 Data Management at the National Climate Change and Wildlife Science Center.
Improving School Communication with Google Apps Anne Dotson SWSD Tech Integration Specialist.
ALA Institutional Repository Update ALA Archives at the University of Illinois Urbana-Champaign Chris Prom Cara Bertram Denise Rayman.
Choosing Between Data Sharing Repositories for Engineering Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
SPSS- Tutorial The following power-point slides show you how to use some of the features in SPSS. A survey of 20 randomly selected companies asked them.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
Data Organization Quality Assurance and Transformations.
Data Management in Clinical Research Rosanne M. Pogash, MPA Manager, PHS Data Management Unit January 12,
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
{ Analyzer Tutorial By You will be able to find the download link of the latest version here.
Lesson 7: Using Mail Merge
Introduction to Supporting Science. What Does Science Involve? Identifying a question to investigate Forming hypotheses Collecting data Interpreting data.
Trials Search Co-ordinators, Archie & RevMan 5 Lynn Hampson, Sheila Wallace, Gail Higgins, Karen Hovhannisyan Tuesday, 13 October 2009.
1 Using DLESE: Finding Resources to Enhance Teaching Shelley Olds Holly Devaul 11 July 2004.
Todd Quinn – Business & Economics Librarian
Digital Images / Write Copy CUFIMA01A Produce And Manipulate Digital Images CUFWRT05A Write Content And/Or Copy.
Safety Mishap Analysis & Retrieval Tool (SMART)
Summit 2017 Breakout Group 2: Data Management (DM)
Safety Mishap Analysis & Retrieval Tool (SMART)
Wrap Up Panel PresQT Workshop University of Notre Dame May 2, 2017
Lesson 1: Introduction to Trifacta Wrangler
General Computer Applications by Barbara Teterycz
Safety Mishap Analysis & Retrieval Tool (SMART)
Lesson 1: Introduction to Trifacta Wrangler
How to Run a DataOnDemand Report
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1 – Chapter 1B Chapter 1B – Terminology
How to Design and Implement Research Outputs Repositories
InControl R2 Overview Running Reports.
Safety Mishap Analysis & Retrieval Tool (SMART)
Managing Private and Public Views of DDI Metadata Repositories
Long-Lived Data Collections
Safety Mishap Analysis & Retrieval Tool (SMART)
Vancouver Public Library
Presentation transcript:

Data Wrangling and Interoperability Andrea Denton Research and Data Services Manager Claude Moore Health Sciences Library Ricky Patterson Data Management Consulting Group University of Virginia Library © 2013 by the Rector and Visitors of the University of Virginia. This work is made available under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license

Goals for the workshop Learn about challenges of interoperability Understand differences between software that appears to be the same (Excel, Google Spreadsheets, etc.) Learn about Open Refine as a tool to fix messy data Gain peer and expert feedback

Challenges of interoperability Data sets can be isolated, fragmented Challenge of combining disparate data sets –Formats (proprietary and open) –Data definitions (units, time steps, etc.) –Missing or poorly formed data

Interoperability with spreadsheets Show sample Excel files Show sample Google Spreadsheet files Demonstrate format save issues Show import/export examples and process

Interoperability with spreadsheets Excel –Excel 2003 (.xls format) –Excel 2007 (.xlsx format begins) –Excel 2010 (Windows) –Excel 2011 (Macintosh) “Save As” Options

Interoperability with spreadsheets

Activity 1: Excel import/export problems Give some files that have problems Ask them to take our csv file, upload it Plant some specific traps for them with misaligned fields, nulls, stuff like that Take old versions and have them try to upload those Have students evaluate export choices and evaluate interoperability

Lessons from Activity 1 Highlight different points and lessons

Activity 2: Google Spreadsheets import/export problems Give some files that have problems Ask them to take our csv file, upload it Plant some specific traps for them with misaligned fields, nulls, stuff like that Take old versions and have them try to upload those Have students evaluate export choices and evaluate interoperability

Wrangling messy data Cleaning a single data set can be tough, but merging two disparate data sets can be much harder Thinking about basic organizational best practices makes all of it easier

Messy issues Identification of relationship in data Different units across data sets Missing data Inequivalent time steps, resolution, grid size, dimensions, scale. Transformation Assumptions, judgment, accuracy from data collection

Open Refine Formerly Google Refine Java applet, web application, but it runs completely locally. Nothing is in the cloud – safe for sensitive data No longer Google project, now just called Open Refine

Open Refine Open Refine is used to clean up files, such as spreadsheets, not to create them –Take existing excel files, and refine them Uses JSON for scripting –Clean up one spreadsheet –Apply same actions to a different spreadsheet Can select a subset of these actions

Open Refine Two exercises: – –

Activity 3: use Open Refine Load 2 or more messy files (suggest specific locations to find messy files) Clean the files Merge the files [FIGURE OUT WHERE] Export a clean version at the end

Discuss lessons learned Open discussion time to talk about issues Ask questions Talk about other experiences of messy data Stress why best practice principles matter

Best Practices Creating Data 1.Use Consistent Data Organization 2.Use Standardized Formats 3.Assign Descriptive File Names 4.Perform Basic Quality Assurance / Quality Control 5.Preserve Information - Use Scripted Languages 6.Define Contents of Data Files; Create Metadata 7.Use Consistent, Stable and Open File Formats

Mailing List Subscription Please check the box on our sign-in sheet to receive occasional s to keep up with our services, training, and news. Please encourage others to subscribe:

More Research Data Services in the Library Offering expert data assistance at every stage of the research process. PLANNING Need a data management plan? We can assist you with developing a data management plan that meets increasingly stringent criteria from funding agencies, including: Implementation of procedures, tools, and workflows for managing data sets Designing a strong study that yields reliable statistics FINDING & COLLECTING Need help finding data or collecting your own? We have thousands of sources with the data you seek and experts who will help you: Locate, evaluate and format data Design metadata and data documentation protocols for new data collection Capture data with the appropriate technology tools for your needs SHARING Ready to share or archive your data? We can consult with you on strategies to help others discover or access your research by: Adhering to data sharing policies and norms Selecting a data-sharing repository Making your data easier to discover and link ANALYZING Want help uncovering unique and compelling insights? Get expert assistance from statistical, spatial, or media specialists to analyze your data and convey your research message: Learn how to use cutting-edge tools and methods Experiment with high-resolution visualization technologies Develop graphical representations that bring impact to your analysis Workshops 1:1 Consultations Class Presentations Contact me at to find out more.

QUESTIONS? Ricky Patterson Data Management Consulting Group University of Virginia Library Data Management Consulting Group University of Virginia Library