Download presentation
Presentation is loading. Please wait.
1
Module 2: Types, Formats, and Stages of Data
Mahria Lebow, Regional Technology Coordinator, NN/LM PNR Jenny Muilenburg, Data Services Curriculum and Communications Librarian Joanne Rich Information Management Librarian, Health Sciences Library University of Washington Libraries Original content by Jen Ferguson, Northeastern University
2
Module 1 Recap Does anyone have any questions about last week’s overview module? Last week Jenny led us in an overview of research data management. We talked about ( ) and ended the hour with the data management plan. Before we get started on today’s topic, does anyone have any questions about last week’s material? Module 2: Data Types, Stages & Formats
3
Module 2: Data Types, Stages & Formats
Learner Objectives 1. Understand differences between data and file types 2. Identify stages of research data 3. Identify common storage formats for data 4. Understand importance of file naming and versioning By the end of today’s session, you will be able to: 1. Understand differences between data and file types 2. Identify stages of research data 3. Identify common storage formats for data 4. Understand importance of file naming and versioning Module 2: Data Types, Stages & Formats
4
Module 2: Data Types, Stages & Formats
Data Management Plan What stages of the data management plan are we addressing today? 1. Types of data 2. Contextual details Last week, in your readings, you might have taken a look at the Example Data Management Plan for an NSF grant. And in our session we looked at the orthopedic surgery study and went through the different parts in an NSF DMP. Todays session will cover in detail the kind of informaiton you’ll need to think about for types of data and some aspects of contextual details, the rest of the nitty gritty about contextual detaisl will be covered next week. Module 2: Data Types, Stages & Formats
5
Research Data Lifecycle
Let’s talk about the life that data lives. Within the frame of a research project data is: collected or created (this is the raw data), processed (cleaned, everything that neeeds to happen so that the data can be analyzed. making the raw data useful and able to be analyzed), it then moves into actual analysis, where the data is interpreted. After the data is analyzed, results are written up, papers and posters can be published. This has traditionally, in many labs been seen as the end point. Maybe the data files that were used, were all moved onto an external storage drive and then put on a shelf in a room where all past study data was kept. Now we are seeing important emphasis being placed on what happens to the data after the study for which it was created is completed. This includes formalizing the process of preserving or archiving the data, making it available for others to access, which in turn makes it possible for a different researcher to use the data, bringing us back tothe beginning of the lifecycle. Jenny also referenced this graphic and cycle last week. I wanted to bring it up again, because it is important to keep in mind. Often in the live action life of research this cycle is so seemless and happens so fast that the big picture view is lost, but it needs to be considered before hand. Remember again last weeks exercise where we looked at the ortho study and needed to retrospectively create a plan - and how many problems you identified? Yeah, that’s what happens when you don’t do a data management plan in advance. Beyond them often times being required now, they do prevent a lot of grief later on. Credit: UK Data Archive
6
Research Data Stages Let’s talk about the life that data lives. Within the frame of a research project data is: collected or created (this is the raw data), processed (cleaned, everything that neeeds to happen so that the data can be analyzed. making the raw data useful and able to be analyzed), it then moves into actual analysis, where the data is interpreted. After the data is analyzed, results are written up, papers and posters can be published. This has traditionally, in many labs been seen as the end point. Maybe the data files that were used, were all moved onto an external storage drive and then put on a shelf in a room where all past study data was kept. Now we are seeing important emphasis being placed on what happens to the data after the study for which it was created is completed. This includes formalizing the process of preserving or archiving the data, making it available for others to access, which in turn makes it possible for a different researcher to use the data, bringing us back tothe beginning of the lifecycle. Jenny also referenced this graphic and cycle last week. I wanted to bring it up again, because it is important to keep in mind. Often in the live action life of research this cycle is so seemless and happens so fast that the big picture view is lost, but it needs to be considered before hand. Remember again last weeks exercise where we looked at the ortho study and needed to retrospectively create a plan - and how many problems you identified? Yeah, that’s what happens when you don’t do a data management plan in advance. Beyond them often times being required now, they do prevent a lot of grief later on. Credit: UK Data Archive
7
Stages of Data Related to Research Data Life Cycle
Sample hypothesis: Water temperatures in Lake Superior are now significantly warmer than in previous years. The evidence lends support to global warming. That’s a bit abstract – let’s try illustrating those stages with a sample research hypothesis: Water temperatures in Lake Superior are now significantly warmer than in previous years. This evidence lends support to global warming. This is much simpler than last week. We haven’t done the study yet and we just have our hypothesis, we are interested in only looking at one variable - water temperature. Raw data What is being measured or observed? This is the data that is being generated during the research project. An example of raw data in our hypothetical research question might be daily measurements of temperature in Lake Superior. Raw data coincides with the ‘creating data’ section of the research data life cycle. Processed data How can the raw data be made useful/manipulable? To continue with the example above, lake temperature data may become processed once researchers remove clearly erroneous temperature measurements from the data set, and enter the remaining temperatures into a spreadsheet for manipulation and analysis. Processed data coincides with the ‘processing data’ section of the research data life cycle. Analyzed data What does the data tell us? Is it significant? How so? For example, we could analyze daily lake temperature data by finding average temperatures, looking at seasonal fluctuations, and generating graphs that demonstrate these changes. Analyzed data coincides with the ‘analyzing data’ section of the research data lifecycle. Lake Superior, 1907 Module 2: Data Types, Stages & Formats
8
Types of Research Data Measurements from sensors and laboratory instruments Computer modeling Observations and/or field studies Specimens Medical images Survey responses Focus group and individual interviews Opinion polling Mapping/GIS data “Table of data collected on T-1 Station buoy at McCrie Shoal” Data is often thought of in quantitative terms. Much research data is indeed quantitative, but ‘number’ and ‘data’ are not synonymous. Data is typically considered in two broad categories: quantitative and qualitative. Quantitative data – which has to do with quantities and measurements of things Qualitative data – which has to do with qualitity and descriptions of things While many types of data are generated in the course of research, existing data can also be used in research projects. Examples include social scientists using census data for their studies, or scientists comparing published gene sequences from various organisms. Some types of research data and data files are fairly ubiquitous and can be found distributed across disciplines. Images and video are data types that are widely generated and used by researchers regardless of discipline. Mapping/GIS data is also used across a wide range of disciplines. Similarly, researchers in many disciplines make use of spreadsheets to store, manipulate, and visualize data. Do you work with a kind of data that isn’t listed here? Module 2: Data Types, Stages & Formats
9
Digital data, files galore
Many types of data, a lot of different files Common Uncommon Same data, different file type So, that’s a lot of different kinds of data. Many of the kinds of data files collected during the course of research are familiar to our everyday or school lives - text files, spreadsheet files. But there are also uncommon files, many that are discipline specific, data type specific and maybe produced/read by online one proprietary program. It is also important to remember and consider that data often has many iterations, as we saw in the life cycle. Data may come out of an digital colleciton tool as a CSV or comma separated value file and then we may save it as an excel file to process the data and the analysis may be done in and result in an SPSS file. Module 2: Data Types, Stages & Formats
10
Format Types for Long-Term Access to Data
Data formats for long-term access are both: Nonproprietary, and Unencrypted and uncompressed Maybe you’ve had the experience of feeling that you wrote this really cool story or essay in 1993 in WordPerfect and its saved on this floppy disk. You go on ebay and buy a floppy disk drive. By some kind of miracle the disk hasn’t degraded, you see your file, but then you can’t get it open - nothing will read it. Or perhaps you’ve already had this happen in a lab environment, where your lab uses a specialized piece of software for processing data on the previous grant and your supervisor just had this realization in the shower about the data, it should be looked at in this new way and asks you to do it and when you pull up those files, you don’t recongize the extension, windows just gives you that error message saying select the program you’d like to use for this and nothing works. So you go online and find that a lot of other people are in a similar boat, posting in forums desperate to open these files, but the company has gone out of business. You’ve probably seen this OSX error message, or the windows equivilant “windows cannot open this file: to open it windows needs to know what program to use” The best way to avoid this is to plan for the future and preserve your research data in a format that isn’t linked to a particular piece of software, isn’t proprietary. We don’t know if, say we’ll still be able to open excel files in 10 years. It seems unlikely that we wouldn’t be able to, but who knows. Plan for software obsolescence Non-proprietary or Open Formats are Readable by more than just the equipment and/or program that generated it. Sometimes proprietary file formats are unavoidable. However, proprietary formats can often be converted to open formats. Unencrypted / uncompressed Formats: Offer the best prospects for long-term access. If files are encrypted and/or compressed, the method used will need to be both discoverable and usable for file access in the future. Module 2: Data Types, Stages & Formats
11
Converting to Preferable Formats
To mitigate the risk of lost information: Note conversion steps taken If possible, keep the original file as well as the converted one In file format conversion, information can be lost In file format conversion, information can be lost. In one fairly common example, converting large image files from TIFF format to JPEG (or JPG) format results in a much smaller file size, but at the expense of information lost from the file. This lost information may be visible as a decrease in the converted file’s quality (as is the case when TIFF images are converted to JPEG). Format conversions may also result in lost contextual information such as creation date, creator, system, or instrument. You may have heard the term ‘metadata’ before; this contextual information is a type of metadata. Carefully note the steps taken during the conversion/migration A best practice, whenever possible, is to keep the original file as well as the converted one Data may need to be converted from the original format to a preferred data preservation format in preparation for long-term storage, or to deposit them with the UK Data Archive. Conversion is best done by the researcher familiar with the data, to ensure data integrity during conversion. When data are converted from one format to another - through export or by using data translation software - certain changes may occur to the data: for data held in statistical packages, spreadsheets or databases, some data or internal metadata such as missing value definitions, decimal numbers, formulae or variable labels may be lost during conversion to another format, or data may be truncated for textual data, editing such as highlighting, bold text or headers/footers may be lost. After conversion data should be checked for errors or changes. Module 2: Data Types, Stages & Formats
12
Module 2: Data Types, Stages & Formats
Preferred Formats Examples of preferred formats for various data types include: Moving Images: MOV, MPEG-4 Audio: WAVE, MP3 Numbers/statistics: CSV Images: TIFF, JPG 2000 Text: PDF/A, txt, OpenDocument Text by scorche Here are some examples of preferred formats for data. There are different opinions out there on how long term you should plan and what formats will still be around in 10 years time. Some say that yes, .xls files are fine, clearly microsoft will still exist in 10 years, but again if we think of word perfect or coral draw type files and how at the time we didn’t think we would have trouble opening those now. This is a great category to ask a librarian for support with in choosing files for long term storage. We’ll also talk more about this in subsequent sessions (7?) For these and more, see: Module 2: Data Types, Stages & Formats
13
Module 2: Data Types, Stages & Formats
File Management File naming conventions Version control A fundamental part of good data management is good file management. This includes using file naming convetions, version control and most importantly consistency and following the convetions across and between collaborators. Module 2: Data Types, Stages & Formats
14
File Naming Conventions
Consistency Across files Between people Over time Descriptive Capture “aboutness” in file name Write it down! Why are file naming conventions important. I know this seems like a really boring topic and maybe even a no brainer. And, in some ways it is until it isn’t. It will quickly become more of an issue the more individuals you have collaborating. Help you find your data Help others find your data Help track which version of a file is most current Establish conventions for file naming and folder naming. Be descriptive enough that someone other than the creator of the file can guess what the file contains. When possible include in the file name contextual details: like something that links it to the particular project (grant name, project name), unique condition or variable, run number if applicable and the date. General file naming best practices apply here to, no spaces in the file name or symbols, because these can be misinterpreted by some operating systems and software. Once you have a convention established, write it down! Module 2: Data Types, Stages & Formats
15
Module 2: Data Types, Stages & Formats
File versioning Basic File name includes date: file_yyyynndd.extension File name includes version number: file_yyyymmdd_v1.extension document, document, document Complex Tools to assist Git Mercurial Bazaar Subversion “Readme” version control table Tracks version number, owner, date changed Depending on the complexity of your study (number of processing steps, number of collaborators), versioning can sometimes be handled by good file naming conventions. (what does v1 mean? when does it reach v2?) If you find that you have many iterations, related files, and collaborators a more complex system may be necessary. (say something about software programs) There are software programs that may be able to help you. These programs were primarly developed to assist in version control on open source software development projects, to allow for distributed networks of people to build on the same thing. Depending on who you talk to they are either incredibly simple to use to intimidating. Unless you have someone in your lab who is familiar, there might be a bit of a learning curve here. Another option is creating a tracking system. This would involve having a spreadsheet in each folder with study related files in it that keeps track of each “base file’s” creator, maintainer, related files and most importantly versions. Have any of you edited a wiki before, you know how you can look at the edit history and it tells you who the last editor was, what their change was and when? Yeah, that’s helpful. This is similar, let’s look at an example. Module 2: Data Types, Stages & Formats
16
File versioning From the UK Data Archive
Title: Vision screening tests in Essex nurseries File Name: VisionScreenResults_00_05 Description: Results data of 120 Vision Screen Tests carried out in 5 nurseries in Essex during June 2007 Created By: Chris Wilkinson Maintained By: Sally Watsley Created: 04/07/2007 Last Modified: 25/11/2007 Based on: VisionScreenDatabaseDesign_02_00 Version Responsible Notes Last amended 00_05 Version 00_03 and 00_04 compared and merged by SW 00_04 Vani Yussu Entries checked by VY, independent from SK 17/10/2007 00_03 Steve Knight Entries checked by SK 29/07/2007 00_02 Karin Mills Test results entered 05/07/2007 00_01 Test results 1-80 entered File versioning From the UK Data Archive Depending on the complexity of your study (number of processing steps, number of collaborators), versioning can sometimes be handled by good file naming conventions.
17
Module 2: Data Types, Stages & Formats
Exercise! 5 minutes to read “Module 2 Activity: Identifying Types and Stages of Data 15 minutes to answer questions 10 minute report back / share with another group? Module 2: Data Types, Stages & Formats
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.