WS-VLAM: Towards a Scalable Workflow System on the Grid V. Korkhov, D. Vasyunin, A. Wibisono, V. Guevara-Masis, A. Belloum Institute of informatics Faculty of Science University of Amsterdam
Outline Introduction: what is WS-VLAM? Architecture of the WS-VLAM Large-scale workflow support: Distributed workflow engine and multi-cluster execution support Hierarchical resource management and workload balancing Workflow farming Semantic workflow support Conclusions
Introduction WS-VLAM (Virtual Lab AMsterdam) concepts: Data driven workflow system Data streaming between workflow components running on the Grid Components: input and output ports for data exchange; parameters for control (during runtime as well); graphical output (X11) supported GUI and engine decoupled, interfaced using WS-RF Engine (RTS – Run Time System): Implemented as GT4 WS-RF service Uses GT4 features (delegation service, GSI, notifications etc.)
WS-VLAM architecture
Large-scale distributed workflows support Multi-cluster distributed experiments: distributed workflow engine Heterogeneous resources: workload balancing and resource management Complex workflows with parameter sweeps and iterative processing: workflow farming Semantic support
Distributed workflow engine GT4 Service Container WS-RTSM Factory EPR GRAM WS-RTSM Instance Worker nodes Workflow components Workflow components GRAM Worker nodes Workflow components Workflow components GT4 Service Container WS-RTSM Instance GUI proxy Data proxy Data proxy GUI proxy Distributed RTSM WS-RTSM Factory Distributed RTSM Cluster 1Cluster 2 WS-VLAM GUI Resource Manager
Hierarchical resource management and workload balancing Task level: Adaptive workload balancing for parallel applications (MPI) on heterogeneous resources Job level: inter-task workload distribution and balancing for multi-task applications (DIANE user-level scheduling env.) Workflow level: workflow farming
Workload balancing strategy (parallel and multi-task applications) Distribution of divisible workload between tasks based on application characteristics (communications/computations ratio) and resource characteristics (CPU, memory, bw) Weights are assigned to all the resources that execute tasks according to their capacities Fast heuristic algorithm for approximate weighting of resources processing the workload Iterative processing of similar data; measuring execution performance for each iteration and adapting weights (and thus workload distribution) on the fly
Workflow farming: adaptive data distribution WF WF 1 WF 2 EstimatorDistributor Each farmed workflow gets a single data element to process first to assess its performance. The speed of processing is evaluated, then the future workload distribution is determined according to this information. Weights reflecting the performance are assigned to the workflows. WF 1 is twice as slow! W=1 W=2 Iterative processing: Independent data or parameters
WF 1 WS-RTSM 1 WF 2 WS-RTSM 2 WF 3 WS-RTSM 3 Resource Manager Workflows WF1,2,3 are running, having WS interface, ready to process data from the RM “on-demand” RTSM Factory XML topology Data to farm Perf Workflow farming: WF service List of WS-RTSM EPRs Performance data GUI
Semantic workflow support
Conclusions WS-VLAM features towards large scale data driven workflows support: Multi-cluster support for a single workflow, ability for data exchange between internal nodes of different clusters Adaptive workload balancing for parallel applications (workflow components) on heterogeneous resources Workload balancing on workflow level: parameter/data sweep for workflow Semantic support for workflow composition