Preservation and Curation of Complex Digital Objects DATA ORGANIZATION

Preservation and Curation of Complex Digital Objects DATA ORGANIZATION
ETD Research Data and Complex Digital Objects DATA ORGANIZATION What is an ETD, ultimately? Training ground for new scholars/researchers Need to make sure the ETD prepares them for a career Evidence record is key to all research

Welcome and Workshop Background
Instructors Gabrielle V. Michalek, Director of Connected Scholarship Purpose Provide you with resources and tools to help you address the challenges and opportunities “data organization” methods pose and provide for you as a researcher, particularly regarding your research outputs. Electronic Theses and Dissertations used to be PDFs. Increasingly, students report that non-PDF files are just as or more important research outputs and evidence. How do you make sure that your research outputs are organized in ways that ensure their later accessibility and potential use? How do you make sure that your research outputs are described adequately for you (or others) to make sense of your variables and other characteristics?

Learning Objectives: Students
Understand options for data management and data organization. Gain exposure to techniques and resources you may use to ensure your data will be readable and understandable in the future. Understand where to look for field-specific analysis methods, services, tools, and repositories. As you develop your research, you will have to consider how to manage and organize the data you’re gathering. The type of research you are doing and the standards of your field(s) will affect your data organization and analysis practices. When you prepare and submit your final work as a thesis or dissertation, earlier decisions you have made about how the data is structured and organized will have implications for how readily others will be able to access and make use (or sense) of your data. Each field has specific methods of analysis, as well as software tools that have been developed to help researchers accomplish that analysis. As you begin to gather, clean, and organize the data, consider not just how you will need to use the data today, but how to make sure it will be readable and understandable in the future. Whether your data is organized in lists, arrays, hash sets, dictionaries, queues, trees, heaps, or relational databases, it is important to be aware of disciplinary norms, as well as institutional and funder requirements, which will make its deposit, preservation, and long-term support more likely. Increasingly, the path for long-term support involves taking steps to make sure your data is deposited alongside data collected by others in your field or discipline

Workshop and Guidance Briefs - Topics
Copyright Data Organization File Formats Metadata Storage Version Control This current module is part of a larger series of ETD Research Data and Complex Digital Object curation and management resources. This slide gives you a sense of the “Big Picture” of the components you will encounter as you create and need to manage your thesis or dissertation and the research files that support or comprise it. We encourage you to attend other workshops—either in person or online—to learn about the other curation and preservation tasks that are essential to digital research management. ALSO, there are “Guidance Briefs” available on each of these topics. These are easy-to-use briefs that point you to valuable resources and references and tools that will help you, not just as you prepare your ETDs, but throughout your careers.

Key Takeaway The decisions you make about how you organize and structure your data today will have implications for how you and others can access and make use (or sense!) of that data in the future.

Why is data hard to deal with?
Data without data documentation (e.g., a data dictionary) is often impossible to understand. Without access to specific (often expensive) software, a data file may be unable to be viewed or used. IRB and funder requirements may impact the way you need to structure your data. As data usage increases, data often needs to be interoperable in order to enable sharing and reuse. The costs of not doing data management can be very high! Key point: if you are not actively managing your data, you are highly likely to lose access to or context for that data. Think about the key research file you’re likely to produce during your graduate schoolwork. That might be your thesis or dissertation; it may also be a dataset. Now, think about where and how you’ll store that file. Let’s say your most important research is an SPSS file. What are the threats to its usefulness in 10 years? 1) You may misinterpret a variable or not recall the full details of the file and/or its relationship(s) to other files in the future If you do not include a data dictionary or “readme” file describing all of the elements of your dataset and any additional relationships that exist between the dataset and other files. 2) The file itself can’t be rendered or opened by you. That might be because you no longer have the software package you need—SPSS is expensive, after all. If you’re not on the tenure track at a research institution, you very well may not have access to it. It might also be unable to be opened because of backwards compatibility issues. Many files become obsolete over time, even if the software package itself is available, because of all the changes that happen to the software and operating system over time. 3) The file may be corrupted or altered. Particularly if you don’t have sole custody of the file—if it’s a shared dataset produced in a lab, for example—you may not know if the file is accurate. These are just a few of the scenarios that can happen. Today, we’ll talk a bit about how to make sure you have the tools you need today to ensure these issues don’t paralyze you down the road.

Questions to ask…repeatedly!
What are the data organization standards for your field? What are the data export options in the software you are using? What forms of the data will be needed for future access? The following questions should be considered for any project that gathers data. These questions should be considered first at the planning stage, again as data is being gathered and stored, and once more prior to final deposit into a digital archive or repository. What are the data organization standards for your field? For example, there are often standards for labeling data fields that will make your data machine-readable. There may also be specific variables and coding guidelines that you can use that will make your work interoperable with other datasets. Lastly, there may be accepted hierarchies and directory structures in your discipline that you can build upon. What are the data export options in the software you are using? If using proprietary and/or highly specialized software to analyze large data sets, export the data in a format that is likely to be supported in the future, and that will be accessible from other software programs. This usually means choosing an open format that is not proprietary. Remember that you may not have access to the same software in the future, and not all software upgrades can read old file types. What forms of the data will be needed for future access? Consider the various forms the data may take, and the scale of the data involved. You may need to preserve not only the underlying raw data, but also the resulting analyses you have created from it.

Structuring your data well enables you to:
Reproduce results Reuse it in the future Share it with others Gain and retain credibility Comply with IRB/funder requirements

Providing Context for Your Data
Document The data’s purpose A list of the files in your data package Data dictionary listing and describing all variables Always document the following about any data you create (ideally, in a readme text file to be stored along with the data).

Data Organization Principles
Use one variable per column. Make one observation per row. Use human-readable column names. Include one table per tab. Indicate relationships between tables using a key. Some fundamental basics in data organization include the following: Use one and ONLY one variable per column, and one and ONLY one observation per row. Embedding multiple variables in the same column or row will compromise your ability to understand, let alone parse, the data. This becomes particularly challenging after time has passed and your memory fades (and it will!!) Use human-readable column names. This ensures that if the data dictionary is lost or unable to be rendered in the future, you can still understand the basic features of the dataset by looking at the dataset itself. Movie Title Director Distributor Running Time Budget Released Peter Pan Herbert Brenon Paramount Pictures 105 minutes 40,030 Dec Girl Shy Fred C. Newmeyer and Sam Taylor Pathe Exchange 82 minutes 400,000 Apr Greed Eric Von Stroheim Metro-Goldwyn-Mayer 140 minutes 665,603 Dec

Additional Principles
Do: Consider what your NULL values are and how they are represented Consider what contextual documentation is required Use standard data representations (e.g., (YYYYMMDD for dates) Do Not: Use formatting to convey information Place comments in cells Use special characters in field names Use blank spaces or symbols in column names Some additional do’s and don’ts include: Use standard data representations like geographical locations or dates. This is crucial for interoperability between datasets. And one dependable feature of datasets is that, in the future, interoperability will grow and the utility of any given dataset is likely to be limited and/or enhanced by the standards to which it has complied. Do not use special characters in field names. These tend to “break” over time, particularly when datasets are migrated or exported. Special characters also can make it much more challenging to have your data ready for interoperability with other datasets.

Discipline-based data repositories:
Social Sciences: ICPSR Genomics: GenBank Earth Sciences: NASA’s Earthdata Archaeology: tDAR Oceanography: NODC BioSciences: Dryad These are just a few of the discipline-based data repositories available today. Your faculty advisors should be able to help you identify appropriate repositories for your datasets and databases. (Note to instructor: Taking them on a quick tour of ICPSR’s site and deposit process provides a great illustration of such discipline-specific dataset repositories.)

Carnegie Mellon data repository and Tools:
Kilthub – University Institutional Repository Open Science Framework - free and open source tool that can be used for managing projects and collaborations in any discipline ws/open-science-framework

Version control is all about PROCESS.
Version control is all about process. This process can be manual or software-assisted, but it requires organizing the way you work. The image here is an example—far from perfect, but it does show a real-world example of version control. Key here is having a process that stays consistent and that yields file names that can stand alone and still make sense. Here, you see two layers of version control in action. You’re looking at a set of embedded folders. Where it gets interesting is at the last folder stage—note that those are dated and the revisions are marked. So revisions is, presumably, a folder full of revisions of Guidance Documents that were produced on March Can anyone tell me why this is dated instead of ? (answer: this is the most common date convention in the US context. It works best because the year is the most stable figure (and thus comes first), the month is the second most stable figure (and thus comes second) and the day fluctuates the most (and thus comes last). That ensures that the resulting string will rank naturally by year first (all the 2015s will be together, all of the 2001’s will be together, etc). Think about how this would look if these dates were instead in the month-day-year format. The revision folders would stack very differently—and once you get into multi-year projects, the years would scramble because they come last in the numerical string, e.g.,: Vs

Version Control image1_v1.jpg image1_v2.jpg image2_v1.jpg
OK OOPS Better image1_v1.jpg image1_v2.jpg image2_v1.jpg image2_v2.jpg ... image1_v1.jpg image1_v10.jpg image1_v2.jpg ... image1_ image1_ image1_ ... A simple method to designate a revision is to note it at the end of the file name. This way, files can be grouped by their name and sorted by version number. For example: image1_v1.jpg image1_v2.jpg image2_v1.jpg image2_v2.jpg ... If you use version numbers, one issue that can arise is that computers will sort files based on the position of the characters. This can lead to strange, unhelpful results. For example: image1_v10.jpg A good practice that can help you to avoid these problems is to use dates to designate version numbers. If you choose this strategy, format dates as year-month- day ( ). Using this order will help avoid confusion when collaborating with other researchers or systems that use a day-month-year or month-day-year, and it will help your computer sort versions in chronological order. For example: image1_ image1_ image1_

Version Control – Collaborative Documents
dataset1_ _KES dataset1_ _WTC dataset1_ _GSC … If the files you are using are created or edited collaboratively, you may want to incorporate names or initials into your file naming conventions so that you know which versions contain updates by each individual on your team. For example: dataset1_ _KES dataset1_ _WTC dataset1_ _GSC …

Resources MATRIX at Michigan State University gives file naming advice: Udacity offers a free online course on using Git and GitHub: github--ud775 Hello World offers another helpful GitHub guide: Version Control with Subversion is a free book authored by Subversion software developers:

Source - Guidance Briefs: Managing Your ETD Research Files
Data Organization Structuring your data well enables you to: Reproduce results Reuse it in the future Share it with others Gain and retain credibility Comply with IRB/funder requirements The decisions you make about how you organize and structure your data today will have implications for how you and others can access and make use (or sense!) of that data in the future. Context and Data Documentation: Include the following in a readme text file: The data’s purpose A list of the files in your data package Data dictionary listing and describing all variables Data Organization Principles: Use one variable per column Make one observation per row Use human-readable column name Include one table per tab Include an ID or key to indicate any relationship between tables Whether your data is organized in lists, arrays, hash sets, dictionaries, queues, trees, heaps, or relational databases, it is important to be aware of disciplinary norms, as well as both institutional and funder requirements, that will make its deposit, storage, and long-term support more likely. Increasingly, the path for long-term support involves taking steps to make sure your data is deposited alongside data collected by others in your field or discipline. Questions to consider for any data project: What are the data organization standards for your field? What are the data export options for your software? What forms of the data will be needed for future access? The DataONE Best Practices database provides individuals with recommendations on how to effectively work with their data through all stages of the data lifecycle. Do: Consider what your NULL values are and how they are represented Consider what data documentation is required Use standard data representations (e.g., (YYYYMMDD for dates) Do Not: Use formatting to convey information Place comments in cells Use special characters in field names Use blank spaces or symbols in column names Discipline-based data repository examples: -Social Sciences: ICPSR -Genomics: GenBank -Earth Sciences: NASA’s Earthdata -Archaeology: tDAR -Oceanography: NODC -BioSciences: Dryad Three main issues you want to get across: Decisions made today about how to organize and structure data will have implications for how it may be used in the future. Different disciplines have different norms, both for structuring data and for depositing data. Every dataset needs a readme text file that describes the data’s purpose, a list of the data files, and a data dictionary that lists and describes all variables used. NOTE TO INSTRUCTOR: this can be printed as a handout Source - Guidance Briefs: Managing Your ETD Research Files

Source - Guidance Briefs: Managing Your ETD Research Files
Version Control Version Control: The process of managing changes to your files over time (aka, revision control or source control) Manual Version Control A simple method to store the current revision is at the end of the file name. This way, files can be grouped by their names and sorted by version number: •filename-v01.jpg •filename-v02.jpg •… You can also use dates to designate version numbers, using year-month-day ( ) to help your computer sort versions in chronological order: •filename jpg •filename jpg If the files you are using are created or edited collaboratively, incorporate names or initials so you know who updated which version: •filename KES.jpg •filename WTC.jpg Software-Assisted Version Control There are also software tools that can help you version your content. These tools store your content in such a way that they can remember its state from revision to revision. Usually, they also allow you to “check in” and “check out” your content, ensuring that revisions never happen simultaneously in two different locations (e.g., if collaborating researchers both attempt to revise the same file at the same time, or a researcher unwittingly tries to revise the same file on two different machines). Key differences between these software-assisted methods and the manual methods include: You can only view and edit the working version of a file When you change a file, you can save a revision and attach a short summary of your changes. Research is active and iterative. You will edit and re-edit your research materials many times before finishing your thesis or dissertation. How will you know that you are working with the most current revision of your materials? Resources (For more information) The digital humanities center MATRIX (Michigan State University) provides advice on how to structure file names based on oral history projects that is broadly applicable: e-naming-in-the-digital-age Udacity offers a free online course on how to use Git and GitHub with interactive exercises to familiarize you with using the tools. to-use-git-and-github--ud775 Another helpful GitHub guide is available from Hello World. o-world/ The Subversion community provides free access to the book Version Control with Subversion: bean.com/ Three main issues you should be familiar with at this point: Your research files will change over time; you need to ensure you will know what each file is and that you establish a clear historical record of your research creation process. Instead of saving over a file, save versions of the file as you make changes to it. An easy way to do this is to establish a standard way of naming and/or storing your files that makes it easy to tell what each is (e.g., with version numbers or dates at the end of a filename). You may also rely on software tools that are made to “version” your content, e.g. “Subversion”; these are especially useful when multiple authors/researchers have access to the same file and/or may make changes to a file over time. NOTE TO INSTRUCTOR: this can be printed as a handout Source - Guidance Briefs: Managing Your ETD Research Files

Activity Choose one spreadsheet you are using for a current data-gathering project. Use the “Data Organization Principles” and check to see if your file meets those requirements. Create a data dictionary for the spreadsheet that describes the meaning of each column header. Note: The Data Organization Principles are documented under “Structure” in the Guidance Briefs and in slide 10: Data Organization Principles.

? Questions?

Preservation and Curation of Complex Digital Objects DATA ORGANIZATION

Similar presentations

Presentation on theme: "Preservation and Curation of Complex Digital Objects DATA ORGANIZATION"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Preservation and Curation of Complex Digital Objects DATA ORGANIZATION

Similar presentations

Presentation on theme: "Preservation and Curation of Complex Digital Objects DATA ORGANIZATION"— Presentation transcript:

Similar presentations

About project

Feedback