Experiments with SmartGridSolve: Achieving Higher Performance by Improving the GridRPC Model Thomas Brady, Michele Guidolin, Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin
Introduction GridSolve Programming system for distributed computing Based on RPC SmartGridSolve is an extension of GridSolve Aims to achieve higher performance
Motivation Core aspects of GridSolve effecting application performance Mapping How tasks are assigned to servers Execution Model How tasks are executed in the distributed environment
Motivation: GridSolve Overview Maps tasks individually on a star network Mapping: Map task individually Execution Model : Client-Server Model Star Network
Motivation: SmartGridSolve Overview Maps a group of tasks on a fully connected network Mapping: Map a group of tasks Execution Model : Fully-Connected Network
Motivation : Performance Increase Performance increase Improved load balancing of computation Reducing volume of communication Improved load balancing of communication
Motivation: Performance Increase GridSolve Tasks are mapped individually Slow servers may be assigned large tasks. SmartGridSolve : Improves load balancing of computation Each task in the group is known prior to mapping Workload of a group of tasks is balanced across servers.
GridSolve Data transfers are mapped to client links only. Unnecessary data transfers Client links are heavily loaded SmartGridSolve : Reduces volume of communication Data dependencies are known Reduce Data Transfers Eliminate bridge communication Server-Server Communication Caching SmartGridSolve : Improved load balancing of communication Volume of data transfer are known More even distribution of communication load Motivation : Performance Increase
Mapping Design: GridSolve Mapping Star Network Discovery Task Discovery time flops Individual Task Mapping Heuristic
Design : GridSolve Execution Model Client Mapping execute task send input args recv output args
Mapping Performance Model of Fully Connected Network Task Graph Mapping Heuristic N …… Mapping Heuristic 2 Mapping Heuristic 1 Mapping Heuristics Design: SmartGridSolve Mapping a Group of Tasks
Client Mapping caching server comm Design: SmartGridSolve Executing a Group of Tasks
Executing on a Client-Server Execution Model Mapping Heuristic Individual Task Discovery Star Network Discovery Executing on a Server Comm. Enabled Execution Model Mapping Heuristic Group of Tasks Group of Task Discovery Fully Connected Network Discovery Processes for mapping individual tasks on a star network Processes for mapping a group of tasks on a fully connected network Design : SmartGridSolve Extensions
Design: Network Discovery Star Network Discovery Static Performance (LINPACK Benchmark) Dynamic Performance (CPU Load) Dynamic Client-Server Bandwidth Fully-Connected Network Discovery Dynamic Server- Server Bandwidth
Design: GridSolve Task Discovery Static Parameters Number of arguments Argument – (input, output) Argument – (scalar/nonscalar) Argument object types (matrix, vector.) Argument data types (int, double, etc..) Function for complexity (flops) Run-Time Parameters Dimensions of arguments, variables of complexity function.
Design: GridSolve Task Discovery Static Parameters Discovered when server is registered Run-Time parameters Discovered when task is called. complexity input arg size output arg size input arg size..... output arg size....
Build a task graph before any of the tasks are called Design: SmartGridSolve Group of Tasks Discovery
GridSolve Discovery, Mapping and Execution are one atomic operation SmartGridSolve To map a group of tasks Separate the discovery and mapping from the execution of tasks. Addition to GridSolve API gs_smart_map(“ex_map”){ //group of tasks } Design: SmartSolve Group of Tasks Discovery
Application Real World Application Hydropad An astrophysics application that simulates the clustering of galaxies from the big bang till present.
Internal Structure It constitutes of four parts Initialisation Gravitation (FFT) Dark Matter (N -Body) Baryonic Matter (PPM)
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute
Individual Mapping for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute
Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Start Discovery
Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Discover
Task Graph
Network Performance Model
Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Map Group of Task
Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Start Execution
Group Mapping gs_smart_map(“ex_map”){ for(i=0;i<nb_evolutions;i++) {... grpc_call(grav_hndl,phiold,...); grpc_call_async(dark_hndl,&sid_dark,x3dm,...); grpc_call_async(bary_hndl,&sid_bary,v3bm,...); /* wait for non blocking calls to finish */ grpc_wait(sid_dark); grpc_wait(sid_bary);... } Execute
barmatter 7 fields 5 darkmatter 1 barmatter 7 fields 5 darkmatter 1 hcl08 3 hcl06 1 hcl08 3 hcl08 3 hcl06 1 hcl08 3 poor load balancing Forced bridge communication GridSolve Mapping of Evolution Stage
barmatter 7 fields 5 darkmatter 1 barmatter 7 fields 5 darkmatter 1 hcl08 3 hcl08 3 hcl06 1 hcl08 3 hcl08 3 hcl06 11 improved load balancing SmartSolve Mapping of Evolution Stage reduced communication volume caching server comm
Published in paper: ~2 times speedup (client and servers on local network) Results
New enhancements: Client and server broadcast Asynchronous communication ~5 times speedup (client and servers on local network) ~13 times speedup (client and servers on remote networks) Results
Conclusion SmartGridSolve improves performance Improved load balancing of computation Reduces the volume of communication Improved load balancing of communication
Future Work Future work Fault Tolerance Improved synchronisation of tasks Functional Performance model ADL language and compiler