Data Rescue! Data Management Series: Workshop 1 HUMANS RESEARCH DATA SERVICE
Introductions!
Knowledge around data policies, resources, archiving, & preservation Consultation for data management planning & implementation Workshops on data management, documentation, and data publishing Data Management Plan reviews and DOI minting services Solutions for public access to research data Centralized, private storage for active (“working”) data (with NCSA) Research Data Service (RDS) The Research Data Service provides the Illinois research community with expertise, tools, and infrastructure to manage and steward research data. visit: researchdataservice.illinois.edu or
Expertise Knowledge around data policies, tools, resources, archiving, and preservation Consultation and workshops for data management planning and implementation Tools Data Management Plan creation wizard (DMPTool.org) Tools for data citation (DOI minting) Infrastructure Illinois Data Bank (self-deposit institutional data repository) What do we do?
Workshop goals Know what you have where it lives how to access it Learn Organization strategies File naming pointers Types of documentation Practice Identifying and grouping your data Not a digitization workshop – but you may find some of this useful in thinking about digitization planning.
Everyone should have Handout 1 x Red post it to write things that were confusing or questions 1 x Blue post it to write things that were helpful to indicate that you’re done with an activity 1 blank pad of post it notes Pen or pencil
What data do you have? Research data… Research projectsAdministrativeStudentsRecordingsSpecimensClasses
Activity 1: Mini survey Please take a few minutes to complete a mini assessment survey. You will not be turning this survey into us, so feel free to be as honest as you’d like. Change the wording on the questions to better speak to your data or research objects as you like.
Activity 2: Data inventory You came here to rescue something, right? Or to get a fresh start for new projects. Take a moment to think of those things. Step 1 column: Identify the data/projects you work with. E.g. specimens, government databases, instrument data, images, etc. You may choose to answer this in terms of all the data your lab have, just the data you work with, or other files your projects depend on. Write the name of this data or project in the rows under the Step 1 column.
Activity 2: Data inventory What type of project is this data used for? Examples: class, grant, personal, student projects, etc. What type of data is this? e.g. file formats, content areas, etc. Where is the data located? e.g. file path, cabinet location, shelves, office, etc. How do you access this data? e.g. specific software, hardware, physical access, etc. Is the data backed up and how? e.g. cloud, external hard drive, flash drive, or not at all Optional: Scratch these questions out and add something of your own if one of these doesn’t make sense. Step 2 Step 3
Activity 2: Data inventory Step 4: Inventory assessment Turn your sheet back over and look at your initial survey answers. Do you want to change anything? More data than you expected? More independent/dependent than you expected?
Before we go on… Before we move on to the last part of our activity… Let’s take a moment to discuss some essential elements of organizing data files or projects in a digital platform.
Consistent File and Folder Naming For quickly finding and sorting files and folders, the names should be consistent but unique. Avoid special characters. project name/acronym experiment/instrument type site location information (if applicable) researcher initials date (consistently formatted, i.e. YYYYMMDD) version number (w/ leading zeros) General theme: Scale ruins all informality – Think ahead.
Date Tip BAM Co-Exp Run txt BAM Co-Exp Run txt BAM Co-Exp Run txt Run 1 B anth meth Sept 4.txt BAM Rxn _09_04.txt _meth_3.txt vs.
Take the guess work out of choosing between: a preferred spelling behavior vs behaviour a scientific or popular term pig vs porcine vs Sus scrofa domesticus determining which synonym to use record vs entry determining which abbreviation to use (if you have to) USA vs US Controlled Vocabulary
Some examples Putting this all together, we can look at an example project.
Noble (2009)’s Bioinformatics project structure Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e doi: /journal.pcbi
An example data project JeopardyHTMLPlayers/ Player-1.html Player-2.html Player-3.html etc… playerDataFiles/ Player-1.csv Player-2.csv Player-3.csv playerdata.csv jeopardy_scrape.ipynb jeopardy_dataprep.ipynb jeopardy_analysis.rmd jeopardy_analysis.html readme.txt visualizations/ bystate.png byregion.png kenjennings.png etc… Code to produce these graphs is stored & documented in the rmd file. Storing one distinct entity type per file. The semantic link between the contents of these files is encoded in the file extensions, which are the unique entity IDs. Document the meaning of these ID numbers. Separate folders mean I don’t have to filter from a giant list of files. Make folders as large amounts of similar files are created, but not always required. Separate scripts by purpose to keep code from being cluttered.
Basic Documentation Types of Documentation: ◦Descriptive (e.g., creator, title, keywords) ◦Structural (e.g., relation to other files) ◦Administrative (e.g., software & hardware requirements, rights information) If you know the data will be deposited in a repository, understand the documentation requirements early in the process
Data Documentation Continuum Low-Barrier Fast Easy Irregular Incomplete Low-QualityHigh-Barrier Slow Skilled Standardized Rich High-Quality Informal ReadMe Formal Schema
Activity 3: Arranging your local data catalog Step 1: Write the name of each data/project from Activity 2 onto a post-it note. We’re now going to group these items in a few different ways. Use your worksheet to take notes on how effective each strategy is. Group 1: type of project Group 2: type of data Group 3: method of access Can you think of other combinations?
Activity 4: Workflow mapping Think of your normal analysis workflow or pick one that you commonly perform. Determine some of the core steps or actions that you take. Write each step on a post it note and place them in order. Start very general and add detail as needed. When one of these actions involves data: Draw a line in the middle of the post-it Write down where that data is located and backed up Draw a diagram of your workflow on your worksheet
Activity discussion/wrapup What did we learn from this? Which grouping made the most sense? How do these groups compare to how things are currently stored at your home institution? Are any better or worse? Homework: take 5 minutes to sketch out a new structure of organizing your data files. Is this possible to implement? Is it possible to maintain over time?