1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961-01 Week 11, November 15, 2011 Data Workflow Management, Data Stewardship.

Slides:



Advertisements
Similar presentations
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
Advertisements

NASA Earth Science Data Preservation Content Specification H. K. (Rama) Ramapriyan John Moses 10 th ESDSWG Meeting – November 2, 2011 Newport News, VA.
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
Scientific Workflows Systems : In Drug discovery informatics Presented By: Tumbi Muhammad Khaled 3 rd Semester Department of Pharmacoinformatics.
3. Technical and administrative metadata standards Metadata Standards and Applications.
The Kepler Project Overview, Status, and Future Directions Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis.
NOAA Metadata Update Ted Habermann. NOAA EDMC Documentation Directive This Procedural Directive establishes 1) a metadata content standard (International.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
January, 23, 2006 Ilkay Altintas
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
UML - Development Process 1 Software Development Process Using UML (2)
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
1 Guidelines For The Future Sharing Best Practice For National Bibliographies In The Digital Era Neil Wilson Information Coordinator IFLA Bibliography.
1 Peter Fox Data Science – ITEC/CSCI/ERTH-6961 Week 10, November 6, 2012 Data Workflow Management, Data Preservation and Stewardship.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
NEPTUNE Canada Workshop Oceans 2.0 Project Environment NEPTUNE Canada DMAS Team Victoria, BC February 16, 2009.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.
Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI Week 13,
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Archival Information Packages for NASA HDF-EOS Data R. Duerr, Kent Yang, Azhar Sikander.
Creating Archive Information Packages for Data Sets: Early Experiments with Digital Library Standards Ruth Duerr, NSIDC MiQun Yang, THG Azhar Sikander,
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
L6-S1 UML Overview 2003 SJSU -- CmpE Advanced Object-Oriented Analysis & Design Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I College.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH Week 11, April 20, 2010 Information management and workflow.
Symposium on Global Scientific Data Infrastructures Panel Two: Stakeholder Communities in the DWF Ann Wolpert, Massachusetts Institute of Technology Board.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
1 Peter Fox Xinformatics 4400/6400 Week 10, April 9, 2013 Information management, workflow and discovery /check-in for project definitions.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
 Key integrating concepts  Groups  Formal Community Groups  Ad-hoc special purpose/ interest groups  Fine-grained access control and membership 
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
ISWG / SIF / GEOSS OOSSIW - November, 2008 GEOSS “Interoperability” Steven F. Browdy (ISWG, SIF, SCC)
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
1 Peter Fox Data Science – ITEC/CSCI/ERTH-4350/6350 Week 10, November 5, 2013 Data Workflow Management, Data Preservation and Stewardship.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
International Planetary Data Alliance Registry Project Update September 16, 2011.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
NASA Earth Science Data Stewardship
Strategies for NIS Development
Data Workflow Management, Data Preservation and Stewardship
Persistent Identifiers Implementation in EOSDIS
Summit 2017 Breakout Group 2: Data Management (DM)
Active Data Management in Space 20m DG
Tools of Software Development
Prepared by: Jennifer Saleem Arrigo, Program Manager
Bird of Feather Session
Presentation transcript:

1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 11, November 15, 2011 Data Workflow Management, Data Stewardship

Contents Scientific Data Workflows Data Stewardship Summary Next class(es) 2

Scientific Data Workflow What it is Why you would use it Some more detail in the context of Kepler – Some pointer to other workflow systems 3

4 What is a workflow? General definition: series of tasks performed to produce a final outcome Scientific workflow – “data analysis pipeline” –Automate tedious jobs that scientists traditionally performed by hand for each dataset –Process large volumes of data faster than scientists could do by hand

5 Background: Business Workflows Example: planning a trip Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. Each task may depend on outcome of previous task –Days you reserve the hotel depend on days of the flight –If hotel has shuttle service, may not need to rent a car

6 What about scientific workflows? Perform a set of transformations/ operations on a scientific dataset Examples –Generating images from raw data –Identifying areas of interest in a large dataset –Classifying set of objects –Querying a web service for more information on a set of objects –Many others…

7 More on Scientific Workflows Formal models of the flow of data among processing components May be simple and linear or more complex Can process many data types: –Archived data –Streaming sensor data –Images (e.g., medical or satellite) –Simulation output –Observational data

8 Challenges Questions: –What are some challenges for scientists implementing scientific workflows? –What are some challenges to executing these workflows? –What are limitations of writing a program?

9 Challenges Mastering a programming language Visualizing workflow Sharing/exchanging workflow Formatting issues Locating datasets, services, or functions

10 Kepler Scientific Workflow Management System Graphical interface for developing and executing scientific workflows Scientists can create workflows by dragging and dropping Automates low-level data processing tasks Provides access to data repositories, compute resources, workflow libraries

11 Benefits of Scientific Workflows Documentation of aspects of analysis Visual communication of analytical steps Ease of testing/debugging Reproducibility Reuse of part or all of workflow in a different project

12 Additional Benefits Integration of multiple computing environments Automated access to distributed resources via web services and Grid technologies System functionality to assist with integration of heterogeneous components

Why not just use a script? Script does not specify low-level task scheduling and communication May be platform-dependent Can’t be easily reused May not have sufficient documentation to be adapted for another purpose 13

Why is a GUI useful? No need to learn a programming language Visual representation of what workflow does Allows you to monitor workflow execution Enables user interaction Facilitates sharing of workflows 14

The Kepler Project Goals –Produce an open-source scientific workflow system enable scientists to design scientific workflows and execute them –Support scientists in a variety of disciplines e.g., biology, ecology, astronomy –Important features access to scientific data flexible means for executing complex analyses enable use of Grid-based approaches to distributed computation semantic models of scientific tasks effective UI for workflow design

Usage statistics Source code access –154 people accessed source code –30 members have write permission –Projects using Kepler: SEEK (ecology) SciDAC (molecular bio,...) CPES (plasma simulation) GEON (geosciences) CiPRes (phylogenetics) CalIT2 ROADnet (real-time data) LOOKING (oceanography) CAMERA (metagenomics) Resurgence (Computational chemistry) NORIA (ocean observing CI) NEON (ecology observing CI) ChIP-chip (genomics) COMET (environmental science) Cheshire Digital Library (archival) Digital preservation (DIGARCH) Cell Biology (Scripps) DART (X-Ray crystallography) Ocean Life Assembling theTree of Life project Processing Phylodata (pPOD) FermiLab (particle physics) Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh

Distributed execution Opportunities for parallel execution –Fine-grained parallelism –Coarse-grained parallelism Few or no cycles Limited dependencies among components ‘Trivially parallel’ Many science problems fit this mold –parameter sweep, iteration of stochastic models Current ‘plumbing’ approaches to distributed execution –workflow acts as a controller stages data resources writes job description files controls execution of jobs on nodes –requires expert understanding of the Grid system Scientists need to focus on just the computations –try to avoid plumbing as much as possible

–Higher-order component for executing a model on one or more remote nodes –Master and slave controllers handle setup and communication among nodes, and establish data channels –Extremely easy for scientist to utilize requires no knowledge of grid computing systems Distributed Kepler OUT IN MasterSlave Controller

Token {1,5,2} Need for integrated management of external data –EarthGrid access is partial, need refactoring –Include other data sources, such as JDBC, OpeNDAP, etc. –Data needs to be a first class object in Kepler, not just represented as an actor –Need support for data versioning to support provenance e.g., Need to pass data by reference –workflows contain large data tokens (100’s of megabytes) –intelligent handling of unique identifiers (e.g., LSID) Token ref-276 {1,5,2} Data Management AB

Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data –Enable data sharing & reuse –Enhance data discovery at global scales Scalable analysis and synthesis –Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues –Enable communication and collaboration for analysis –Enable reuse of analytical components –Support scientific workflow design and modeling

SEEK data access, analysis, mediation Data Access (EcoGrid) –Distributed data network for environmental, ecological, and systematics data –Interoperate diverse environmental data systems Workflow Tools (Kepler) –Problem-solving environment for scientific data analysis and visualization  “scientific workflows” Semantic Mediation (SMS) –Leverage ontologies for “smart” data/component discovery and integration

Managing Data Heterogeneity Data comes from heterogeneous sources –Real-world observations –Spatial-temporal contexts –Collection/measurement protocols and procedures –Many representations for the same information (count, area, density) –Data, Syntax, Schema, Semantic heterogeneity Discovery and “synthesis” (integration) performed manually –Discovery often based on intuitive notion of “what is out there” –Synthesis of data is very time consuming, and limits use

Scientific workflow systems support data analysis KEPLER

Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions,...) A simple Kepler workflow (T. McPhillips)

Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. Lists Nexus files to process (project) Reads text filesParses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. A simple Kepler workflow (T. McPhillips)

An example workflow run, executed as a Dataflow Process Network A simple Kepler workflow

SMS motivation Scientific Workflow Life-cycle –Resource Discovery discover relevant datasets discover relevant actors or workflow templates –Workflow Design and Configuration data  actor (data binding) data  data (data integration / merging / interlinking) actor  actor (actor / workflow composition) Challenge: do all this in the presence of … –100’s of workflows and templates –1000’s of actors (e.g. actors for web services, data analytics, …) –10,000’s of datasets –1,000,000’s of data items –… highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!

Some other workflow systems SCIRun Sciflo Triana Taverna Pegasus Some commercial tools: –Windows Workflow Foundation –Mac OS X Automator Survey.pdfhttp:// Survey.pdf See reading for this week 28

Data Stewardship Putting a number of data life cycle, management aspects together Keep the ideas in mind as you complete your assignments Why it is important Some examples 29

Why it is important 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: ) BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, [Online]. Available: ) R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long- term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_Inter national_Polar_Year:_Making_Data_and_Information_Availa ble_for_the_Long_Term.ppt ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_Inter national_Polar_Year:_Making_Data_and_Information_Availa ble_for_the_Long_Term.ppt 30

At the heart of it Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. Inability to know the inter-relations, assumptions and missing information We’ll look at a (data) use case for this shortly But first we will look at what, how and who in terms of the full life cycle 31

What to collect? Documentation –Metadata –Provenance Ancillary Information Knowledge 32

Who does this? Roles: –Data creator –Data analyst –Data manager –Data curator 33

How it is done Opening and examining Archive Information Packages Reviewing data management plans and documentation Talking (!) to the people: –Data creator –Data analyst –Data manager –Data curator Sometimes, reading the data and code 34

Data-Information-Knowledge Ecosystem 35 DataInformationKnowledge ProducersConsumers Context Presentation Organization Integration Conversation Creation Gathering Experience

Acquisition Learn / read what you can about the developer of the means of acquisition –Documents may not be easy to find –Remember bias!!! Document things as you go Have a checklist (see the Data Management list) and review it often 36

Fox VSTO et al. 37

Curation (partial) Consider the organization and presentation of the data Document what has been (and has not been) done Consider and address the provenance of the data to date, you are now THE next person Be as technology-neutral as possible Look to add information and metainformation 38

Preservation Usually refers to the full life cycle Archiving is a component Stewardship is the act of preservation Intent is that ‘you can open it any time in the future’ and that ‘it will be there’ This involves steps that may not be conventionally thought of Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 39

Some examples and experience NASA, NOAA _and_Stewardshiphttp://wiki.esipfed.org/index.php/Preservation _and_Stewardship Library community Note: –Mostly in relation to publications, books, etc but some for data –Note that knowledge is in publications but the structure form is meant for humans not computers, despite advances in text analysis –Very little for the type of knowledge we are considering: in machine accessible form 40

Back in the day... NASA SEEDS Working Group on Data Lifecycle Second Workshop Report o o Many LTA recommendations Earth Sciences Data Lifecycle Report o o Many lessons learned from USGS experience, plus some recommendations SEEDS Final Report (2003) - Section 4 o o Final recommendations vis a vis data lifecycle MODIS Pilot Project GES DISC, MODAPS, NOAA/CLASS, ESDIS effort Transferred some MODIS Level 0 data to CLASS

Mostly Technical Issues Data Preservation o Bit-level integrity o Data readability Documentation Metadata Semantics Persistent Identifiers Virtual Data Products Lineage Persistence Required ancillary data Applicable standards

Mostly Non-Technical Issues Policy (constrained by money…) Front end of the lifecycle o Long-term planning, data formats, documentation... Governance and policy Legal requirements Archive to archive transitions Money (intertwined with policy) Cost-benefit trades Long-term needs of NASA Science Programs User input o Identifying likely users Levels of service Funding source and mechanism

HDF4 Format "Maps" for Long Term Readability C. Lynnes, GES DISC R. Duerr and J. Crider, NSIDC M. Yang and P. Cao, The HDF Group Use case: a real live one; deals mostly with structure and (some) content HDF=Hierarchical Data Format NSIDC=National Snow and Ice Data Center GES=Goddard Earth Science DISC=Data and Information Service Center

In the year A user of HDF-4 data will run into the following likely hurdles: The HDF-4 API and utilities are no longer supported... o...now that we are at HDF-7 The archived API binary does not work on today's OS's o...like Android 3.1 The source does not compile on the current OS o...or is it the compiler version, gcc v. 7.x? The HDF spec is too complex to write a simple read program... o...without re-creating much of the API What to do?

HDF Mapping Files Concept: create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now) XML Stored separately from, but close to the data files Includes o internal metadata o variable info o chunk-level info  byte offsets and length  linked blocks  compression information Task funded by ESDIS project The HDF Group, NSIDC and GES DISC

Map sample (extract)

Status and Future Status Map creation utility (part of HDF) Prototype read programs o C o Perl Paper in TGRS special issue Inventory of HDF-4 data products within EOSDIS Possible Future Steps Revise XML schema Revise map utility and add to HDF baseline Implement map creation and storage operationally o e.g., add to ECS or S4PA metadata files

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group NASA/ MODIS Contextual Info

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Instrument/sensor characteristics 50 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Processing Algorithms & Scientific Basis 51 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Ancillary Data 52 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Processing History including Source Code 53 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Quality Assessment Information 54 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Validation Information 55 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Other Factors that can Influence the Record 56 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Bibliography 57 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Contextual Information: Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.) Instrument/sensor calibration data and method Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product) Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product 58 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Contextual Information (continued): Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive Quality assessment information Validation record, including identification of validation data sets Data structure and format, with definition of all parameters and fields In the case of earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set Information received back from users of the data set or product 59 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

However… Even groups like NASA do not have a governance model for this work Governance: is the activity of governing. It relates to decisions that define expectations, grant power, or verify performance. It consists either of a separate process or of a specific part of management or leadership processes. Sometimes people set up a government to administer these processes and systems. (wikipedia) 60

Who cares… Stakeholders: –NASA for integrity of their data holdings (is it their responsibility?) –Public for value for and return on investment –Scientists for future use (intended and un- intended) –Historians 61

Library community OAIS – Open Archival Information System, ormation_System ormation_System OAI (PMH and ORE) – Open Archives Initiative (Protocol for Metadata Harvesting and Object Reuse and Exchange), Do some reading on your own for this 62

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Metadata Standards - PREMIS Provide a core preservation metadata set with broad applicability across the digital preservation community Developed by an OCLC and RLG sponsored international working group –Representatives from libraries, museums, archives, government, and the private sector. Based on the OAIS reference model Preservation Metadata Interchange Std.

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Metadata Standards - PREMIS Maintained by the Library of Congress Editorial board with international membership User community consulted on changes through the PREMIS Implementers Group Version 1 was released in June 2005 Version 2 was just released

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Rights Events Agents “a coherent set of content that is reasonably described as a unit” For example, a web site, data set or collection of data sets “a coherent set of content that is reasonably described as a unit” For example, a web site, data set or collection of data sets “a discrete unit of information in digital form” For example, a data file “a discrete unit of information in digital form” For example, a data file “assertions of one or more rights or permissions pertaining to an object or an agent” e.g., copywrite notice, legal statute, deposit agreement “assertions of one or more rights or permissions pertaining to an object or an agent” e.g., copywrite notice, legal statute, deposit agreement “an action that involves at least one object or agent known to the preservation repository” e.g., created, archived, migrated “an action that involves at least one object or agent known to the preservation repository” e.g., created, archived, migrated “a person, organization, or software program associated with preservation events in the life of an object” e.g., Dr. Spock donated it PREMIS - Entity-Relationship Diagram Intellectual Entities Objects

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group PREMIS - Types of Objects Representation - “the set of files needed for a complete and reasonable rendition of an Intellectual Entity” File Bitstream - “contiguous or non-contiguous data within a file that has meaningful common properties for preservation purposes”

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Information from users Data Errors found Quality updates Things that need further explanation Metadata updates/additions? Community contributed metadata????

Back to why you need to… E-science uses data and it needs to be around when what you create goes into service and you go on to something else That’s why someone on the team must address life-cycle (data, information and knowledge) and work with other team members to implement organizational, social and technical solutions to the requirements 68

(Digital) Object Identifiers Object is used here so as not to pre-empt an implementation, e.g. resource, sample, data, catalog –DOI = e.g /s – visit crossref.org and see where this leads you. –URI, entifier e.g n85/fulltext.pdf entifier 338n85/fulltext.pdf –XRI (from OAIS), open.org/committees/xrihttp:// open.org/committees/xri 69

Versioning Is a key enabler of good preservation Is a tricky trap for those that do not conform to written guidelines for versioning ision_controlhttp://en.wikipedia.org/wiki/Rev ision_control 70

Summary The progression toward more formal encoding of science workflow, and in our context data-science workflow (dataflow) is substantially improving data management Awareness of preservation and stewardship for valuable data and information resources is receiving renewed attention in the digital age Workflows are a potential solution to the data stewardship challenge Which brings us to the final assignment 71

Final assignment See web (10% of grade). 72

What is next Final assignment due in two weeks Next week – written part of group project due Next week - Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration Reading for this week – see wiki Last class is week 13, Nov. 20 – project presentations (and final assignment due) 73