Data Workflow Management, Data Preservation and Stewardship

Data Workflow Management, Data Preservation and Stewardship
Greg Hughes and Peter Fox Data Science – ITEC/CSCI/ERTH-4350/6350 Module 9, Week 11, November 15, 2016

Contents Reading – none from last week Scientific Data Workflows
Data Stewardship Summary Next class(es) Projects

Scientific Data Workflows
What are they? Why you would use them? Some more detail in the context of Kepler Some pointers to other workflow systems

What is a workflow? General definition: series of tasks performed to produce a final outcome E.g. following a recipe to bake a pie Scientific workflow – “data analysis pipeline” Automate tedious jobs that scientists traditionally performed by hand for each dataset Process large volumes of data faster than scientists could do by hand Adapted from Slides by Bill Howe UW 2006

Background: Business Workflows
Example: planning a trip Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. Each task may depend on outcome of previous task Days you reserve the hotel depend on days of the flight If hotel has shuttle service, may not need to rent a car E.g. tripit.com Adapted from Slides by Bill Howe UW 2006

What about scientific workflows?
Perform a set of transformations/ operations on a scientific dataset Examples (cf. Data Mining examples) Generating images from raw data Identifying areas of interest in a large dataset Classifying set of objects Querying a web service for more information on a set of objects Many others… Adapted from Slides by Bill Howe UW 2006

More on Scientific Workflows
Formal models of the flow of data among processing components May be simple and linear, or more complex Can process many data types: Archived data Streaming sensor data Images (e.g., medical or satellite) Simulation output Observational data Adapted from Slides by Bill Howe UW 2006

Challenges Questions:
What may be some challenges for scientists implementing scientific workflows? What are some challenges to executing these workflows? Adapted from Slides by Bill Howe UW 2006

Challenges Mastering a programming language
Visualizing the workflow, end-to-end Sharing/exchanging that workflow, updating it Dealing with data formatting Locating datasets, services, or functions (input and outputs, metadata…) Adapted from Slides by Bill Howe UW 2006

E.g. Kepler Scientific Workflow Management System
Graphical interface for developing and executing scientific workflows Scientists create workflows by dragging and dropping Automates low-level data processing tasks Provides access to data repositories, compute resources, workflow libraries Adapted from Slides by Bill Howe UW 2006

Additional Benefits of Scientific Workflows
Documentation of aspects of analysis Visual communication of analytical steps Ease of testing/debugging Reproducibility Reuse of part or all of workflow in a different project Adapted from Slides by Bill Howe UW 2006

Additional Benefits Integration of multiple computing (deployment) environments Automated access to distributed resources via web services and Cloud technologies System functionality to assist with integration of heterogeneous components Adapted from Slides by Bill Howe UW 2006

Why not just use a script?
Scripts do not (usually) specify low-level task scheduling and communication Often are very platform-dependent -> Can’t be easily reused? May not have sufficient documentation to be adapted for another purpose Based on Howe (2006)

Why is a GUI useful? No need to learn a programming language
Visual representation of what workflow does Allows you to monitor workflow execution Enables user interaction Facilitates sharing of workflows Adapted from Howe (2006)

The Kepler Project Goals
Produce an open-source scientific workflow system enable scientists to design scientific workflows and execute them Support scientists in a variety of disciplines e.g., biology, ecology, astronomy Important features access to scientific data flexible means for executing complex analyses enable use of Grid-based approaches to distributed computation semantic models of scientific tasks effective UI for workflow design Slides from Matt Jones presentation

Distributed execution
Opportunities for parallel execution Fine-grained parallelism Coarse-grained parallelism Few or no cycles Limited dependencies among components ‘Trivially parallel’ Many science problems fit this mold parameter sweep, iteration of stochastic models Current ‘plumbing’ approaches to distributed execution workflow acts as a controller stages data resources writes job description files controls execution of jobs on nodes requires expert understanding of the Grid system Scientists need to focus on just the computations try to avoid plumbing as much as possible Slides from Matt Jones presentation

Distributed Kepler Higher-order component for executing a model on one or more remote nodes Main and client controllers handle setup and communication among nodes, and establish data channels Extremely easy for scientist to utilize requires no knowledge of grid computing systems IN Slides from Matt Jones presentation OUT Controller Controller Main Client

Managing Data Heterogeneity?
Data comes from heterogeneous sources Real-world observations Spatial-temporal contexts Collection/measurement protocols and procedures Many representations for the same information (count, area, density) Data, Syntax, Schema, Semantic heterogeneity Discovery and “synthesis” (integration) performed manually Discovery often based on intuitive notion of “what is out there” Synthesis of data is very time consuming, and limits use

Scientific workflow systems support data analysis
KEPLER

A simple Kepler workflow
Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) (T. McPhillips)

Lists Nexus files to process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. (T. McPhillips)

An example workflow run, executed as a Dataflow Process Network

Workflow validation in Kepler
Statically perform semantic and structural type checking Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … dito (beautify)

+ Provenance Framework
Tracks origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…) Types of Provenance Information: Data provenance Intermediate and end results including files and db references Process (=workflow instance) provenance Keep the workflow definition with data and parameters used in the run Error and execution logs Workflow-design provenance 3 slides from Ilkay Altintas

Kepler Provenance Recording Utility
Parametric and customizable Different report formats Variable levels of detail Verbose-all, verbose-some, medium, on error Multiple cache destinations Saves information on User name, Date, Run, etc…

Some other workflow systems
SCIRun Sciflo Triana Taverna Pegasus Some commercial tools: Windows Workflow Foundation Mac OS X Automator See reading for this week

Workflows Important for your last assignment….

Now …. Data Stewardship Combining multiple data life cycle, management aspects together Keep the ideas in mind as you complete your assignments Why it is important Some examples

Why it is important Need ability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. Need ability to know the inter-relations, assumptions and missing information We’ll look at a (data) use case for this shortly But first we will look at what, how and who in terms of the full life cycle

What to collect? Documentation Ancillary Information Knowledge
Metadata Provenance Ancillary Information Knowledge

Who does this? Roles: Data creator Data analyst Data manager
Data curator

How it is done Opening and examining Archive Information Packages! Yes, people look at them. Reviewing data management plans and documentation! Yes, people look at them. Talking (!) to the people: Data creator Data analyst Data manager Data curator Sometimes, reading the data and the code

Data-Information-Knowledge Ecosystem
Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context

Remember - Acquisition
Learn / read what you can about the developer of the means of acquisition Documents may not be easy to find Remember bias!!! Document things as you go Have a checklist (see the Data Management list) and review it often

As soon as you even think about semantics and knowledge encoding/ representation, knowledge is everywhere…. Fox VSTO et al.

Curation Consider the “changes” in the organization and presentation of the data Document what has been (and has not been) done Consider and address the provenance of the data to date, you are now THE next person in the workflow… Be as technology-neutral as possible Look to add information and meta-information

Preservation Usually refers to the latter part of data life cycle
Archiving is only one component Intent is that ‘you can open it any time in the future’ and that ‘it will be there’ This involves steps that may not be conventionally thought of Think 10, 20, 50, 200 years (or 1 hour!) …. looking historically gives some guide to future considerations

Some examples and experience
NASA, NOAA Library community Note: Mostly in relation to publications, books, etc. but some for data Note that knowledge is in publications but the structural form is meant for humans not computers, despite advances in text analysis, NLP Very little for the type of knowledge -> data we are considering: i.e. in machine accessible form NOAA – National Oceanographic and Atmospheric Administration NASA– National Aeronautical and Space Administration

HDF4 Format "Maps" for Long Term Readability
Use case: a real live one; deals mostly with structure and (some) content HDF4 Format "Maps" for Long Term Readability C. Lynnes, GES DISC R. Duerr and J. Crider, NSIDC M. Yang and P. Cao, The HDF Group Courtesy Chris Lynnes and Ruth Duerr HDF=Hierarchical Data Format NSIDC=National Snow and Ice Data Center GES=Goddard Earth Science DISC=Data and Information Service Center

In the year A user of HDF-4 data will run into the following likely hurdles: The HDF-4 API and utilities are no longer supported... ...now that we are at HDF-7 The archived API binary does not work on today's OS's ...like Android 9.1 The source does not compile on the current OS ...or is it the compiler version, gcc v. 12.x? The HDF spec is too complex to write a simple read program... ...without re-creating much of the API What to do? Courtesy Chris Lynnes and Ruth Duerr HDF-4 is version 4 of the HDF format and a very substantial amount of NASA data is stored in HDF-4 format HDF is both a file format and an API API=Application Programming Interface

HDF Mapping Files Concept: create text-based "maps” (cf. Mealy!!) of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now) XML Stored separately from, but close to the data files Includes internal metadata variable info chunk-level info byte offsets and length linked blocks compression information Task funded by ESDIS project The HDF Group, NSIDC and GES DISC In XML -> ASCII format! Courtesy Chris Lynnes and Ruth Duerr

Map sample (extract) <hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5"> <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer"> 0 0 </hdf4:Attribute> <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" /> <hdf4:Dataspace ndims="2"> </hdf4:Dataspace> <hdf4:Datablock nblocks="1"> <hdf4:Block offset=" " nbytes="20582" compression="coder_type=DEFLATE" /> </hdf4:Datablock> </hdf4:SDS> Courtesy Chris Lynnes and Ruth Duerr

Their plans were…. (alas no funds)
Status Map creation utility (part of HDF) Prototype read programs C Perl Paper in TGRS special issue (a Journal) Inventory of HDF-4 data products within EOSDIS (NASA data system) Future Steps Revise XML schema Revise map utility and add to HDF baseline Implement map creation and storage operationally e.g., add to ECS or S4PA metadata files (conventions) Courtesy Chris Lynnes and Ruth Duerr ECS= S4PA EOSDIS TGRS

But to really preserve? CONTEXT!

NASA/ MODIS Contextual Info
7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Instrument/sensor characteristics
Only one significant property - the scientific integrity of the data 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 46 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Processing Algorithms & Scientific Basis

Ancillary Data Only one significant property - the scientific integrity of the data 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 48 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Processing History including Source Code

Quality Assessment Information
Do the results make sense - how likely was it that someone recorded either the input or output incorrectly? That the machine wasn’t functioning properly, configured correctly, etc. 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 50 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Validation Information
Are their standard samples that can be used? 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 51 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Other Factors that can Influence the Record
For example, how does ambient humidity affect the calorimetry results? 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 52 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Bibliography 7th Joint ESDSWG meeting, October 22, Philadelphia, PA
Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 53 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Contextual Information:
Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.) Instrument/sensor calibration data and method Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product) Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product Only one significant property - the scientific integrity of the data Courtesy: Ruth Duerr 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 54 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Contextual Information (continued):
Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive Quality assessment information Validation record, including identification of validation data sets Data structure and format, with definition of all parameters and fields In the case of earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set Information received back from users of the data set or product Only one significant property - the scientific integrity of the data Courtesy: Ruth Duerr 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group 55 Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

However… Even groups like NASA do not have a governance model for this work Governance: is the activity of governing. It relates to decisions that define expectations, grant power, or verify performance. It consists either of a separate process or of a specific part of management or leadership processes. Sometimes people set up a government to administer these processes and systems. (wikipedia)

Who cares… Stakeholders:
NASA for integrity of their data holdings (is it their responsibility?) Public for value for and return on investment Scientists for future use (intended and un-intended) Historians

Library community – recording!
OAIS – Open Archival Information System, OAI (PMH and ORE) – Open Archives Initiative (Protocol for Metadata Harvesting and Object Reuse and Exchange), Do some reading on your own for this

Back to why you need to… E-science uses data and it needs to be around when what you create goes into service and you go on to something else That’s why someone on the team must address life-cycle (data, information and knowledge) and work with other team members to implement organizational, social and technical solutions to the requirements And, you ask how do we know what is what?

(Digital) Object Identifiers
Object is used here so as not to pre-empt an implementation, e.g. resource, sample, data, catalog DOI = e.g /s – visit crossref.org and see where this leads you. URI, e.g. XRI (from OAIS),

And not least … Versioning
Is a key enabler of good preservation Is a tricky trap for those that do not conform to written guidelines for versioning

Summary The progression toward more formal encoding of science workflow, and in our context data-science workflow (dataflow) is substantially improving data management Awareness of preservation and stewardship for valuable data and information resources is receiving renewed attention in the digital age Workflows are a potential solution to the data stewardship challenge

Final assignment Is now on the website (10% of grade).

What is next Next week – Thanksgiving Tuesday (no lecture, come in and work with teams on project!). Week after - Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration Reading for this week – see web site

Data Workflow Management, Data Preservation and Stewardship

Similar presentations

Presentation on theme: "Data Workflow Management, Data Preservation and Stewardship"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Workflow Management, Data Preservation and Stewardship

Similar presentations

Presentation on theme: "Data Workflow Management, Data Preservation and Stewardship"— Presentation transcript:

Similar presentations

About project

Feedback