10 Sep 2005 NVO Summer School Managing VO data and process flows Matthew J. Graham CACR/Caltech T HE US N ATIONAL V IRTUAL O BSERVATORY
10 Sep 2005 NVO Summer School Overview Astronomical data VOStore/VOSpace Workflows Astrogrid workflow CEA
10 Sep 2005 NVO Summer School The importance of data Data is the raison dêtre of the VO LSST is the data source nonpareil –data rates of 540MB/s ~16TB in 8 hrs –final archive > 3PB of data VO Wheel Well-established ways of handling distributed data: – SRB – PVFS – OGSA-DAI
10 Sep 2005 NVO Summer School Data use cases Client has data: –stored locally: transfers it to service –stored locally: service retrieves it –stored elsewhere: service retrieves it Service generates data: –stores it locally: notifies client of location –transfers it to the clients local store –transfers it to a client-designated store
10 Sep 2005 NVO Summer School VOStore Provides a uniform interface to existing or new data storage locations (Facade pattern) Structured/unstructured data both first level Methods: get put list / listAll importInit importData (sync/async) exportInit exportData (sync/async) delete rename
10 Sep 2005 NVO Summer School VOSpace Orchestrates VOStores: –data collections: directories, user-defined –authorisation: user groups –processing efficiency: where is the nearest copy? move copy identifiers
10 Sep 2005 NVO Summer School A virtual super-peer data network?
10 Sep 2005 NVO Summer School How to manage the flows? Way of describing a flow: –processes/steps, inputs/outputs, serial/parallel execution, control logic, variables, inline scripting –preferably XML (verbose but rigourous) Way of controlling a flow: engine e-Science vs. e-Business: –open-ended vs. closed –verification and publication –static vs. dynamic workflows –volume and type of data –meta-transactions –customer, manager and user vs. scientist
10 Sep 2005 NVO Summer School Workflow patterns Sequence: Parallel splitSynchronisatio n AND XORExclusive choiceSimple Merge Multi choice MultiMulti Merge Multi + Synchronizing Merge Multi + Multi Multi + Discriminator Deferred choice Multiple Instances with/out Synch Implicit termination Interleaved Parallel Routing Milestone
10 Sep 2005 NVO Summer School Workflow kerfuffle Workflow languages: BPEL (BPEL4WS, WSBPEL, WSFL, XLANG), BPML, WS-CDL (WSCL, WSCI), XPDL, BPSS, PSL, AGWL, DGL, DPML, GJobDL, GSFL, GFDL, GWorkflowDL, MoML, SWFL, YAWL, SCUFL/Xscufl, WPDL, PIF, PSL, OWL-S, xWFL, XPL, INCA Workflow engines: Taverna, Kepler, Pegasus, DiscoveryNet, Triana, SPA, Geodise, ICENI, Askalon, GridNexus, BioPipe, BizTalk, BPWS4J, DAGMan, GridAnt, GJH, GRMS, GWFE, GWES, ITIEE, JIGSA, Karajan, ScyFLOW, SDSC Matrix, SHOP2, wftk, YAWL Engine, WFEE
10 Sep 2005 NVO Summer School Astrogrid workflow components JES (Job Execution System) –Astrogrid workflow engine –Manages control flow –Runs steps in a controlled asynchronous fashion CEC (Common Execution Controller) –Manages step execution –Manages data flow CEA (Common Execution Architecture) apps –datacenters: support complex quesries against archives –processing: consume data files and reduce them
10 Sep 2005 NVO Summer School Astrogrid workflow schematic PortalRegistryMySpace Command Line CEA Datacenter CEAJES Client library CEC Save/load workflowSave/load data Resolve application Application list Submit workflow
10 Sep 2005 NVO Summer School Astrogrid workflow language description of the workflow 21 ${dec} ftp://aServer/myResults … …
10 Sep 2005 NVO Summer School CEA Create a uniform interface and model for an application and its parameters Provides higher level description than WSDL: –Restrict how interfaces can be expressed –Provide specific semantics for astronomical quantitites –Extra information, such as default values, GUI labels VOResource extensions for a general application Provide asynchronous operation: –callback, polling and job identification Allow separate data and control flows
10 Sep 2005 NVO Summer School Minimum CEA compliance Must implement CommonExecutionConnector interface Must send a message to services implementing ResultsListener interface Should send messages to services implementing JobMonitor interface Should perform basic type checking on all parameter types during init phase