Kepler, Opal and Gemstone Amarnath Gupta University of California San Diego.

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
CACORE TOOLS FEATURES. caCORE SDK Features caCORE Workbench Plugin EA/ArgoUML Plug-in development Integrated support of semantic integration in the plugin.
Unveiling ProjectWise V8 XM Edition. ProjectWise V8 XM Edition An integrated system of collaboration servers that enable your AEC project teams, your.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.
A Computation Management Agent for Multi-Institutional Grids
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Lidar data processing update October 15, 2005 Ramon Arrowsmith Chris Crosby Department of Geological Sciences Arizona State University Efrat Frank, Ashraf.
A Kepler-based Three Tier Architecture applied to LiDAR Interpolation and Analysis Efrat Frank, Ilkay Altintas San Diego Supercomputer Center, UCSD Configuration.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
January, 23, 2006 Ilkay Altintas
Composing Models of Computation in Kepler/Ptolemy II
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
ANSTO E-Science workshop Romain Quilici University of Sydney CIMA CIMA Instrument Remote Control Instrument Remote Control Integration with GridSphere.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
Enabling Access to High-Resolution LiDAR Topography through Cyberinfrastructure-Based Data Distribution and Processing Christopher J. Crosby, J Ramón Arrowsmith.
Efrat Frank, Ashraf Memon, Vishu Nandigam, Chaitan Baru
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.
1 Ilkay ALTINTAS - July 24th, 2007 Ilkay ALTINTAS Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD.
Accessing Grid Resources via Portals and Workflow Tools Accessing Grid Resources via Portals and Workflow Tools Sriram Krishnan, Ph.D.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Application portlets within the PROGRESS HPC Portal Michał Kosiedowski
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Wrapping Scientific Applications As Web Services Using The Opal Toolkit Wrapping Scientific Applications As Web Services Using The Opal Toolkit Sriram.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
WDO-It! 101 Workshop: Creating an abstraction of a process UTEP’s Trust Laboratory NDR HP MP.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Biomedical Informatics Research Network BIRN Workflow Portal.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
OOD OO Design. OOD-2 OO Development Requirements Use case analysis OO Analysis –Models from the domain and application OO Design –Mapping of model.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
Excel Services Displays all or parts of interactive Excel worksheets in the browser –Excel “publish” feature with optional parameters defined in worksheet.
This document is proprietary. Any dispatch or disclosure of content is authorized only after written authorization by MEEO S.r.l. SAS FP – ESA-ESRIN Frascati.
A Semantic Type System and Propagation
Software Development Process Using UML Recap
Introduction to the SHIWA Simulation Platform EGI User Forum,
Presentation transcript:

Kepler, Opal and Gemstone Amarnath Gupta University of California San Diego

2 ISSGC06 – Ischia, Italy Changing Needs for Scientific Process Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict Traditional Scientific Process (before computers) Yesterday…at least for some of us!

3 ISSGC06 – Ischia, Italy What’s different in today’s science? Today’s scientific process Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict More to add to this picture: network, Grid, portals, +++ Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, microarrays, satellite-based sensors, sensor networks, field studies… Analysis, Prediction / Models and model execution: Potentially large computation and visualization

4 ISSGC06 – Ischia, Italy A Brief Recap What are Scientific Workflow Systems trying to achieve? –Creation of a problem solving environment over distributed and mostly autonomous platforms –Seamless access to resources and services –Service composition and reuse –Scalability –Detached execution and yet, user interaction –Reliability and fault tolerance –“Smart” re-runnability –Reproducibility of execution –Information discovery as an aid to workflow design

5 ISSGC06 – Ischia, Italy What is Kepler? Derived from an earlier scientific data-flow system called Ptolemy-II, which is –Designed to model heterogeneous, concurrent systems for engineering applications –An actor-based workflow paradigm Kepler adds to Ptolemy-II –New components for scientific workflows –Structural and Semantic type management –Semantic annotation and annotation propagation mechanisms –Distributed execution capabilities Execution in a grid framework –…

6 ISSGC06 – Ischia, Italy Promoter Identification Workflow Source: Matt Coleman (LLNL)

Promoter Identification Workflow

8 ISSGC06 – Ischia, Italy

9

10 ISSGC06 – Ischia, Italy Enter initial inputs, Run and Display results

11 ISSGC06 – Ischia, Italy Custom Output Visualizer

12 ISSGC06 – Ischia, Italy Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy …Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery

13 ISSGC06 – Ischia, Italy What is an Actor-based Workflow? An actor-based workflow is a graph with three components –Actors: passive (parameterized) programs are specified by their input and output signatures Ports: an actor has a set of input and output ports that are specified by the signature of the data tokens passing through that port No call semantics Attributes –Dataflow connections: a connectivity specification that designates the flow of data from one actor to another Relation: an intermediate data holding station –Director: an execution control model that coordinates the execution behavior of a workflow

14 ISSGC06 – Ischia, Italy Composite Actors Composite actor AW –A pair (W,Σ W ) comprising a subworkflow W and a set of distinguished ports Σ W  freeports(W), the i/o- signature of W –The i/o-signatures of the subworkflow W and of the composite actor AW containing W match, i.e., Σ W = ports(AW) –An actor can be “refined” by treating it as a workflow and adding other restrictions around it Workflow abstraction –One can substitute a subworkflow as a single actor –The subworkflow may have a different director than the higher-level workflow

15 ISSGC06 – Ischia, Italy Mineral Classification Workflow

16 ISSGC06 – Ischia, Italy PointInPolygonalgorithm

17 ISSGC06 – Ischia, Italy Execution Model Actors –Asynchronous: Many actors can be ready to fire simultaneously Execution ("firing") of a node starts when (matching) data is available at a node's input ports. Locally controlled events – Events correspond to the “firing” of an actor Actor: – A single instruction – A sequence of instructions Actors fire when all the inputs are available Directors are the WF Engines that –Implement different computational models –Define the semantics of execution of actors and workflows interactions between actors Process Network (PN) Director –Each actor executes as a separate thread or process –Data connections represent queues of unbounded size. Actors can always write to output ports, but may get suspended (blocked) on input ports without a sufficient number of data tokens. –Performs buffer management, deadlock detection, allows data forks and merges

18 ISSGC06 – Ischia, Italy The Director Execution Phases –pre-initialize method of all actors Run once per workflow execution Are the data types of all actor ports known? Are transport protocols known? –type-check Are connected actors type compatible? –run* initialize –Executed per run –Are all the external services (e.g., web services) working? –Replace dead services with live ones… iteration* –pre-fire »Are all data in place? –fire* –post-fire »Any updates for local state management? –wrap-up

19 ISSGC06 – Ischia, Italy Polymorphic Actors: Components Working Across Data Types and Domains Actor Data Polymorphism : –Add numbers (int, float, double, Complex) –Add strings (concatenation) –Add complex types (arrays, records, matrices) –Add user-defined types Actor Behavioral Polymorphism: –In dataflow, add when all connected inputs have data –In a time-triggered model, add when the clock ticks –In discrete-event, add when any connected input has data, and add in zero time –In process networks, execute an infinite loop in a thread that blocks when reading empty inputs –In CSP, execute an infinite loop that performs rendezvous on input or output –In push/pull, ports are push or pull (declared or inferred) and behave accordingly –In real-time CORBA*, priorities are associated with ports and a dispatcher determines when to add By not choosing among these when defining the component, we get a huge increment in component re- usability. But how do we ensure that the component will work in all these circumstances? Source: Edward Lee et al.

20 ISSGC06 – Ischia, Italy GEON: Geosciences Network Multi-institution collaboration between IT and Earth Science researchers Funded by NSF “large” ITR program GEON Cyberinfrastructure provides: –Authenticated access to data and Web services –Registration of data sets and tools, with metadata –Search for data, tools, and services, using ontologies –Scientific workflow environment –Data and map integration capability –Visualization and GIS mapping

21 ISSGC06 – Ischia, Italy R. Haugerud, U.S.G.S D. Harding, NASA Point Cloud x, y, z n, … Survey Process & Classify Analyze / “Do Science” Interpolate / Grid LiDAR Introduction

22 ISSGC06 – Ischia, Italy LiDAR Difficulties Massive volumes of data –1000s of ASCII files –Hard to subset –Hard to distribute and interpolate Analysis requires high performance computing Traditionally: Popularity > Resources

23 ISSGC06 – Ischia, Italy A Three-Tier Architecture GOAL: Efficient LiDAR interpolation and analysis using GEON infrastructure and tools –GEON Portal –Kepler Scientific Workflow System –GEON Grid Use scientific workflows to glue/combine different tools and the infrastructure Portal Grid

24 ISSGC06 – Ischia, Italy Lidar Workflow Process Configuration phase Subset: DB2 query on DataStar Portal Grid Subset Analyze moveprocess Visualize moverenderdisplay Interpolate: Grass RST, Grass IDW, GMT… Visualize: Global Mapper, FlederMaus, ArcIMS Scheduling/ Output Processing Monitoring/ Translation

25 ISSGC06 – Ischia, Italy Lidar Processing Workflow (using Fledermaus ) Subset Analyze moveprocess Visualize moverenderdisplay Arizona Cluster NFS Mounted Disk IBM DB2 Datastar NFS Mounted Disk d1 d2 (grid file) d2 d1 iView3D/ Browser Create Scene file Fledermaus sd

26 ISSGC06 – Ischia, Italy Lidar Workflow Portlet 1.User selections from GUI 1.Translated into a query and a parameter file 2. Uploaded to remote machine 2.Workflow description created on the fly 3.Workflow response redirected back to portlet

27 ISSGC06 – Ischia, Italy Render Map DB2 Spatial query Client/ GEON Portal NFS Mounted Disk ArcInfo Compute Cluster x,y,z and attribute raw data process output KEPLER WORKFLOW Map Parameters Grass Functions submit Parameter xml Create Workflow description ArcSDE ArcIMS Map onto the grid (Pegasus) Grass surfacing algorithms: Spline IDW block mean … Download data Binary grid ASCII grid Text file Tiff/Jpeg/Gif ASCII grid LIDAR POST-PROCESSING WORKFLOW PORTLET

28 ISSGC06 – Ischia, Italy Portlet User Interface - Main Page

29 ISSGC06 – Ischia, Italy Portlet User Interface - Parameter Entry 1

30 ISSGC06 – Ischia, Italy Portlet User Interface - Parameter Entry 2

31 ISSGC06 – Ischia, Italy Portlet User Interface - Parameter Entry 3

32 ISSGC06 – Ischia, Italy Behind the Scenes: Workflow Template

33 ISSGC06 – Ischia, Italy Filled Template

34 ISSGC06 – Ischia, Italy Example Outputs

35 ISSGC06 – Ischia, Italy With Additional Algorithms

36 ISSGC06 – Ischia, Italy Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy …Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery

37 ISSGC06 – Ischia, Italy The Hybrid Type System Every portal of an actor has a type signature –Structural Types Any type system admitted by the actor –DBMS data types, XML schema, Hindley-Milner type system … –Semantic Types An expression in a logical language to specify what a data object means In the SEEK project, such a statement is expressed in a DL over an ontology –MEASUREMENT  ITEM_MEASURED.SPECIES_OCCURRENCE A workflow is well-typed if –For every pair of connected ports –The structural type of the output port is a subtype of that of the input port –The semantic type of the output port is logically subsumed by that of the input port

38 ISSGC06 – Ischia, Italy Hybridization Constraints A hybridization constraint –a logical expression connecting instances of a structural type with instances of the corresponding semantic type for a port –For a relational type r(site, day, spp, occ) I/O Constraint –A constraint relating the input and output port signatures of an actor Propagating hybridization constraints Having a tuple in r implies that there is a measurement y of the type speciesoccurrence corresponding to x occ

39 ISSGC06 – Ischia, Italy

40 ISSGC06 – Ischia, Italy How can my (grid) application become a Kepler actor? By making it a web service –For applications that have a command line interface –OPAL can convert the application into a web service What is Opal? –a Web services wrapper toolkit Pros: Generic, rapid deployment of new services Cons: Less flexible implementation, weak data typing due to use of generic XML schemas

41 ISSGC06 – Ischia, Italy Condor poolSGE Cluster PBS Cluster Globus Application Services Security Services (GAMA) State Mgmt GemstonePMV/VisionKepler Opal is an Application Wrapping Service

42 ISSGC06 – Ischia, Italy The Opal Toolkit: Overview Enables rapid deployment of scientific applications as Web services (< 2 hours) Steps –Application writers create configuration file(s) for a scientific application –Deploy the application as a Web service using Opal’s simple deployment mechanism (via Apache Ant) –Users can now access this application as a Web service via a unique URL

43 ISSGC06 – Ischia, Italy Opal Architecture Tomcat Container Axis Engine Opal WS Cluster/Grid Resources Container Properties Service Config Scheduler, Security, Database Setups Binary, Metadata, Arguments

44 ISSGC06 – Ischia, Italy Service Operations Get application metadata: Returns metadata specified inside the application configuration Launch job: Accepts list of arguments and input files (Base64 encoded), launches the job, and returns a jobID Query job status: Returns status of running job using the jobID Get job outputs: Returns the locations of job outputs using the jobID Get output as Base64: Returns an output file in Base64 encoded form Destroy job: Uses the jobID to destroy a running job

45 ISSGC06 – Ischia, Italy MEME+MAST Workflow using Kepler

46 ISSGC06 – Ischia, Italy Kepler Opal Web Services Actor

47 ISSGC06 – Ischia, Italy Opal and Gemstone

48 ISSGC06 – Ischia, Italy Opal Summary Opal enables rapidly exposing legacy applications as Web services –Provides features like Job management, Scheduling, Security, and Persistence More information, downloads, documentation: –

49 ISSGC06 – Ischia, Italy Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy …Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery

50 ISSGC06 – Ischia, Italy Joint Authentication Framework Requirements: –Coordinating between the different security architectures GEON uses GAMA which requires a single certificate authority. SEEK uses LDAP with has a centralized certificate authority with distributed subordinate CAS –To connect LDAP with GAMA – Coordinating between 2 different GAMA servers –Single sign-on/authentication at the initialize step of the run for multiple actors that are using authentication This has issues related to single GAMA repository vs multiple, and requires users to have accounts on all servers. Kepler needs to be able to handle expired certificates for long- running workflows and/or for users who use it for a long time. A trust relation between the different GAMA servers must be established in order to allow for single authentication.

51 ISSGC06 – Ischia, Italy Functional Prototype Completed APIs and tests cases in place More work required on certificate renewal and multiple server access

52 ISSGC06 – Ischia, Italy Vergil is the GUI for Kepler Actor ontology and semantic search for actors Search -> Drag and drop -> Link via ports Metadata-based search for datasets Actor Search Data Search

53 ISSGC06 – Ischia, Italy Back to Kepler - Actor Search Kepler Actor Ontology Used in searching actors and creating conceptual views (= folders) Currently 160 Kepler actors added!

54 ISSGC06 – Ischia, Italy Data Search and Usage of Results Kepler DataGrid – Discovery of data resources through local and remote services SRB, Grid and Web Services, Db connections – Registry of datasets on the fly using workflows

55 ISSGC06 – Ischia, Italy Vergil Updates To make it more useful to the user –Updated actor icons –Menu redesign Improve readability Develop cohesive visual language Follow standard HCI principles Improve organization Composite DB Query Computation or Operation Transformation Filter File Operation Web Service

56 ISSGC06 – Ischia, Italy Kepler Archives Purpose: Encapsulate WF data and actors in an archive file –… inlined or by reference –… version control More robust workflow exchange Easy management of semantic annotations Plug-in architecture (Drop in and use) Easy documentation updates A jar-like archive file (.kar) including a manifest All entities have unique ids (LSID) Custom object manager and class loader UI and API to create, define, search and load.kar files

57 ISSGC06 – Ischia, Italy KAR File Example

58 ISSGC06 – Ischia, Italy Kepler Object Manager Designed to access local and distributed objects Objects: data, metadata, annotations, actor classes, supporting libraries, native libraries, etc. archived in kar files Advantages: –Reduce the size of Kepler distribution Only ship the core set of generic actors and domains –Easy exchange of full or partial workflows for collaborations –Publish full workflows with their bound data Becomes a provenance system for derived data objects => Separate workflow repository and distributions easily

59 ISSGC06 – Ischia, Italy Initial Work on Provenance Framework Provenance –Track origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…) Need for Provenance –Association of process and results –reproduce results –“explain & debug” results (via lineage tracing, parameter settings, …) –optimize: “Smart Re-Runs” Types of Provenance Information: –Data provenance Intermediate and end results including files and db references –Process (=workflow instance) provenance Keep the wf definition with data and parameters used in the run –Error and execution logs –Workflow design provenance (quite different) WF design is a (little supported) process (art, magic, …) for free via cvs: edit history need more “structure” (e.g. templates) for individual & collaborative workflow design

60 ISSGC06 – Ischia, Italy Kepler Provenance Recording Utility Parametric and customizable –Different report formats –Variable levels of detail Verbose-all, verbose-some, medium, on error –Multiple cache destinations Saves information on –User name, Date, Run, etc…

61 ISSGC06 – Ischia, Italy Provenance: Possible Next Steps Kepler Provenance –Deciding on terms and definitions –.kar file generation, registration and search for provenance information –Possible data/metadata formats –Automatic report generation from accumulated data –A GUI to keep track of the changes –Adding provenance repositories –A relational schema for the provenance info in addition to the existing XML

62 ISSGC06 – Ischia, Italy What other system functions does provenance relate to? Failure recovery Smart re-runs Semantic extensions Kepler Data Grid Reporting and Documentation Authentication Data registration Re-run only the updated/failed parts Guided documentation generation an updates

63 ISSGC06 – Ischia, Italy Where Kepler Meets the Grid Abstract Grid workflow actors –Stage-execute-fetch (sub-)workflows Copy files from one resource to computation node Perform execution – possibly through a grid job scheduler Get the result files back and continue with the rest of the workflow –Actors Authenticate actor –over Globus Grid, SRB and databases Copy actor –for both stage and fetch Job executor actor –special wrappers for ssh-based execution, web service- clients, Grid job runner proxies, and actors for Nimrod- and APST-based submissions

64 ISSGC06 – Ischia, Italy Where Kepler Meets the Grid Monitoring actor –Light monitoring: user notified on actor failure (e.g. NIMROD) upon completion of actor failure –Medium monitoring: same with immediate notification –Heavy monitoring: notifies every communication including immediate actor failure Filter actor –Filtering and subsetting remote data of different formats Data Discovery actor Service Discovery actor Storage actor Transformation and Query actors –Shim generation –Querying of databases and mediators

65 ISSGC06 – Ischia, Italy Hot Topics in Kepler

66 ISSGC06 – Ischia, Italy To Sum Up … is an open-source system and collaboration – is a ~3 year-old project –grows by application pull from contributors –most topics are designed jointly –is developed by multiple developers under different projects in different countries –Is now being used in actual scientific research The screen shots were results of initial success! There is a lot more to cover and work on… –New foci at SDSC-Kepler around provenance and distributed computing

67 ISSGC06 – Ischia, Italy Amarnath Gupta +1 (858) Questions… Thanks!