1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 20, 2010 Information management and workflow.

Slides:

Advertisements

Similar presentations

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

Advertisements

Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.

Planning for Flexible Integration via Service-Oriented Architecture (SOA) APSR Forum – The Well-Integrated Repository Sydney, Australia February 2006 Sandy.

1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.

Scientific Workflows Systems : In Drug discovery informatics Presented By: Tumbi Muhammad Khaled 3 rd Semester Department of Pharmacoinformatics.

Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager Shawn Bowers (For Chad Berkley & Matt Jones) University of California, Davis.

MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.

Computational Physics Kepler Dr. Guy Tel-Zur. This presentations follows “The Getting Started with Kepler” guide. A tutorial style manual for scientists.

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.

The Kepler Project Overview, Status, and Future Directions Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 Tools of Software Development l 2 types of tools used by software engineers:

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

A Semantic Workﬂow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1 August 15th, 2012 BP & IA Team.

EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Cracow Grid Workshop’10 Kraków, October 11-13,

 Scientific workflow management system based on Ptolemy II  Allows scientists to visually design and execute scientific workflows  Actor-oriented.

Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.

Windows.Net Programming Series Preview. Course Schedule CourseDate Microsoft.Net Fundamentals 01/13/2014 Microsoft Windows/Web Fundamentals 01/20/2014.

January, 23, 2006 Ilkay Altintas

UML - Development Process 1 Software Development Process Using UML (2)

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.

Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 11, November 15, 2011 Data Workflow Management, Data Stewardship.

1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 2, September 7, 2010 Data and information acquisition (curation, preservation) and metadata - management.

1 Peter Fox Xinformatics 4400/6400 Week 10, April 14, 2015 Unstructured Information, Information Audit / Workflow and Discovery.

Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.

Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.

Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher.

SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.

Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.

Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.

Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.

Software Engineering Prof. Ing. Ivo Vondrak, CSc. Dept. of Computer Science Technical University of Ostrava

1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH Week 9, April 5, 2011 Information management, workflow and discovery /check-in for project.

A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.

Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.

ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.

Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.

GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma

Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc

Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center

1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.

1 Peter Fox Xinformatics 4400/6400 Week 10, April 9, 2013 Information management, workflow and discovery /check-in for project definitions.

Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.

CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –

16/11/ Semantic Web Services Language Requirements Presenter: Emilia Cimpian

Chapter 4 Automated Tools for Systems Development Modern Systems Analysis and Design Third Edition 4.1.

August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.

1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.

SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.

Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,

1 Peter Fox Xinformatics Week 9, March 27, 2012 Information management, workflow and discovery /check-in for project definitions.

CIMA and Semantic Interoperability for Networked Instruments and Sensors Donald F. (Rick) McMullen Pervasive Technology Labs at Indiana University

Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.

Enhancements to Galaxy for delivering on NIH Commons

Joseph JaJa, Mike Smorul, and Sangchul Song

Tools of Software Development

Data Warehousing and Data Mining

NSDL Data Repository (NDR)

Computational Physics Kepler

Scientific Workflows Lecture 15

GGF10 Workflow Workshop Summary

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH Week 11, April 20, 2010 Information management and workflow

Contents Review of last class, reading Information life-cycle Information visualization Checking in for project definitions? Discussion of reading Next class 2

Management Creation of logical collections –The primary goal of a Management system is to abstract the physical collection into logical collections. The resulting view is a uniform homogeneous library collection. Physical handling –This layer maps between the physical to the logical views. Here you find items like replication, backup, caching, etc. 3

Management Interoperability support –Normally the data does not reside in the same place, or various collections (like catalogues) should be put together in the same logical collection. Security support –Access authorization and change verification. This is the basis of trusting your information. Ownership –Define who is responsible for quality and meaning 4

Management Metadata collection, management and access. –Metadata are data about data. –Metainformation are information about information Persistence –Definition of lifetime. Deployment of mechanisms to counteract technology obsolescence. Knowledge and information discovery –Ability to identify useful relations and information inside the collection. 5

Management Dissemination and publication –Mechanism to make aware the interested parties of changes and additions to the collections. 6

Logical Collections Identifying naming conventions and organization Aligning cataloguing and naming to facilitate search, access, use Provision of contextual information 7

Physical Handling Where and who does it come from? How is it transferred into a physical form? Backup, archiving, and caching... Formats Naming conventions 8

Interoperability Support Bit/byte and platform/ wire neutral encodings Programming or application interface access Structure and vocabulary (metadata) conventions and standards 9

Security What mechanisms exist for securing? Who performs this task? Change and versioning (yes, the information may change), who does this, how? Who has access? How are access methods controlled, audited? Who and what – authentication and authorization? Encryption and integrity 10

Ownership Rights and policies – definition and enforcement Limitations on access and use Requirements for acknowledgement and use Who and how is quality defined and ensured? Who may ownership migrate too? How to address replication? How to address revised/ derivative products? 11

Metadata How to know what conventions, standards, best practices exist? How to use them, what tools? Understanding costs of incomplete and inconsistent metadata Understanding the line between metadata and data and when it is blurred Knowing where and how to manage metadata and where to store it (and where not to) 12

Persistence Where will you put your information so that someone else (e.g. one of your class members) can access it? What happens after the class, the semester, after you graduate? What other factors are there to consider? 13

Discovery If you choose (see ownership and security), how does someone find your information? How would you provide discovery of collections, versus files, versus ‘bits’? How to enable the narrowest/ broadest discovery? 14

Dissemination 15 Who should do this? How and what needs to be put in place? How to advertise? How to inform about updates? How to track use, significance?

Formats ASCII, UTF-8, ISO Self-describing formats Table-driven Markup languages and other web-based Databases Graphs Unstructured Discussion… 16

Metadata Dublin Core (dc.x) METS ISO in general, e.g. ISO/IEC Geospatial, ISO , FGDC Time, ISO 8601, xsd:datetime Z39.50/ ISO

Summary of Management Creation of logical collections Physical handling Interoperability support Security support Ownership Metadata collection, management and access. Persistence Knowledge and information discovery Dissemination and publication 18

Information Workflow What it is? Why you would use it? Some more detail in the context of Kepler – Some pointers to other workflow systems 19

20 What is a workflow? General definition: series of tasks performed to produce a final outcome Information workflow – “analysis pipeline” –Automate tedious jobs that users traditionally performed by hand for each dataset –Process large volumes of data/ information faster than one could do by hand

21 Background: Business Workflows Example: planning a trip Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. Each task may depend on outcome of previous task –Days you reserve the hotel depend on days of the flight –If hotel has shuttle service, may not need to rent a car

22 What about information workflows? Perform a set of transformations/ operations on a data or information source Examples –Generating images from raw data –Identifying areas of interest in a large dataset –Classifying set of objects –Querying a web service for more information on a set of objects –Many others…

23 More on Workflows Formal models of the flow of data/ information among processing components May be simple and linear or more complex Can process many data/ information types: –Archives –Web pages –Streaming/ real time –Images (e.g., medical or satellite) –Simulation output –Observational data

24 Challenges Questions: –What are some challenges for users in implementing workflows? –What are some challenges to executing these workflows? –What are limitations of writing a program?

25 Challenges Mastering a programming language Visualizing workflow Sharing/exchanging workflow Formatting issues Locating datasets, services, or functions

26 Kepler Workflow Management System Graphical interface for developing and executing scientific workflows Users can create workflows by dragging and dropping Automates low-level processing tasks Provides access to repositories, compute resources, workflow libraries

27 Benefits of Workflows Documentation of aspects of analysis Visual communication of analytical steps Ease of testing/debugging Reproducibility Reuse of part or all of workflow in a different project

28 Additional Benefits Integration of multiple computing environments Automated access to distributed resources via other architectural components, e.g. web services and Grid technologies System functionality to assist with integration of heterogeneous components

Why not just use a script? Script does not specify low-level task scheduling and communication May be platform-dependent Can’t be easily reused May not have sufficient documentation to be adapted for another purpose 29

Why is a GUI useful? No need to learn a programming language Visual representation of what workflow does Allows you to monitor workflow execution Enables user interaction Facilitates sharing of workflows 30

The Kepler Project Goals –Produce an open-source workflow system enable scientists to design scientific workflows and execute them –Support scientists in a variety of disciplines e.g., biology, ecology, astronomy –Important features access to scientific data flexible means for executing complex analyses enable use of Grid-based approaches to distributed computation semantic models of scientific tasks effective UI for workflow design

Usage statistics Source code access –154 people accessed source code –30 members have write permission –Projects using Kepler: SEEK (ecology) SciDAC (molecular bio,...) CPES (plasma simulation) GEON (geosciences) CiPRes (phylogenetics) CalIT2 ROADnet (real-time data) LOOKING (oceanography) CAMERA (metagenomics) Resurgence (Computational chemistry) NORIA (ocean observing CI) NEON (ecology observing CI) ChIP-chip (genomics) COMET (environmental science) Cheshire Digital Library (archival) Digital preservation (DIGARCH) Cell Biology (Scripps) DART (X-Ray crystallography) Ocean Life Assembling theTree of Life project Processing Phylodata (pPOD) FermiLab (particle physics) Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh

Distributed execution Opportunities for parallel execution –Fine-grained parallelism –Coarse-grained parallelism Few or no cycles Limited dependencies among components ‘Trivially parallel’ Many science problems fit this mold –parameter sweep, iteration of stochastic models Current ‘plumbing’ approaches to distributed execution –workflow acts as a controller stages data resources writes job description files controls execution of jobs on nodes –requires expert understanding of the Grid system Scientists need to focus on just the computations –try to avoid plumbing as much as possible

–Higher-order component for executing a model on one or more remote nodes –Master and slave controllers handle setup and communication among nodes, and establish data channels –Extremely easy for scientist to utilize requires no knowledge of grid computing systems Distributed Kepler OUT IN MasterSlave Controller

Token {1,5,2} Need for integrated management of external data –EarthGrid access is partial, need refactoring –Include other data sources, such as JDBC, OpeNDAP, etc. –Data needs to be a first class object in Kepler, not just represented as an actor –Need support for data versioning to support provenance e.g., Need to pass data by reference –workflows contain large data tokens (100’s of megabytes) –intelligent handling of unique identifiers (e.g., LSID) Token ref-276 {1,5,2} Management AB

Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data –Enable data sharing & reuse –Enhance data discovery at global scales Scalable analysis and synthesis –Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues –Enable communication and collaboration for analysis –Enable reuse of analytical components –Support scientific workflow design and modeling

SEEK data access, analysis, mediation Data Access (EcoGrid) –Distributed data network for environmental, ecological, and systematics data –Interoperate diverse environmental data systems Workflow Tools (Kepler) –Problem-solving environment for scientific data analysis and visualization  “scientific workflows” Semantic Mediation (SMS) –Leverage ontologies for “smart” data/component discovery and integration

Managing Heterogeneity Data comes from heterogeneous sources –Real-world observations –Spatial-temporal contexts –Collection/measurement protocols and procedures –Many representations for the same information (count, area, density) –Data, Syntax, Schema, Semantic heterogeneity Discovery and “synthesis” (integration) performed manually –Discovery often based on intuitive notion of “what is out there” –Synthesis of data is very time consuming, and limits use

Scientific workflow systems support data analysis KEPLER

Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions,...) A simple Kepler workflow (T. McPhillips)

Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. Lists Nexus files to process (project) Reads text filesParses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. A simple Kepler workflow (T. McPhillips)

An example workflow run, executed as a Dataflow Process Network A simple Kepler workflow

SMS motivation Scientific Workflow Life-cycle –Resource Discovery discover relevant datasets discover relevant actors or workflow templates –Workflow Design and Configuration data  actor (data binding) data  data (data integration / merging / interlinking) actor  actor (actor / workflow composition) Challenge: do all this in the presence of … –100’s of workflows and templates –1000’s of actors (e.g. actors for web services, data analytics, …) –10,000’s of datasets –1,000,000’s of data items –… highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!

Some other workflow systems SCIRun Sciflo Triana Taverna Pegasus Some commercial tools: –Windows Workflow Foundation –Mac OS X Automator Survey.pdfhttp:// Survey.pdf See reading for this week 44

Summary The progression toward more formal encoding of science workflow, and in our context data-science workflow (dataflow) is substantially improving data management Awareness of preservation and stewardship for valuable data and information resources is receiving renewed attention in the digital age Workflows are a potential solution to the data stewardship challenge 45

Discussion About management? Workflow? 46

Reading for this week Is retrospective 47

Check in for Project Assignment Analysis of existing information system content and architecture, critique, redesign and prototype redeployment 48

What is next Week 12 – Information Discovery, Information Integration, review of all course material and check on learning objectives (next week) Break on May 4, no class Week 13 – Project presentations (May 11, i.e. in 3 weeks) Note IDEA surveys will be sent out soon 49