Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P
Overview Objectives Building 3rd party libraries Building VisIt Running VisIt on BG/P Improvements Impact Future work Objectives Building 3rd party libraries Building VisIt Running VisIt on BG/P Improvements Impact Future work
Objectives Port VisIt to IBM’s BlueGene/P platform so VisIt can run on LLNL’s Dawn and eventually Sequoia –Dawn is a 500 Teraflop, 36,864 node, 147,456 cpu, IBM BG/P system –4 850MHz PowerPC cores/node, 4Gb Memory/node –Compute nodes run CNK OS –Cross-compile code for CNK Identify weaknesses in VisIt that prevent it from scaling to tens/hundreds of thousands of processors Port VisIt to IBM’s BlueGene/P platform so VisIt can run on LLNL’s Dawn and eventually Sequoia –Dawn is a 500 Teraflop, 36,864 node, 147,456 cpu, IBM BG/P system –4 850MHz PowerPC cores/node, 4Gb Memory/node –Compute nodes run CNK OS –Cross-compile code for CNK Identify weaknesses in VisIt that prevent it from scaling to tens/hundreds of thousands of processors
Building 3rd party libraries Built all libraries on login nodes for regular Linux PowerPC version of VisIt –Ran into runtime problems using xlC compiler so reverted to g++ for the time being Cross-compiled all libraries for CNK No support for this platform in VisIt’s 3rd party libraries so special builds were required Mesa built unmangled and no X11 VTK tricky to build –No OpenGL so VTK built with Mesa as its OpenGL –No X11 so created custom render window –Used CMake toolchain file Built all libraries on login nodes for regular Linux PowerPC version of VisIt –Ran into runtime problems using xlC compiler so reverted to g++ for the time being Cross-compiled all libraries for CNK No support for this platform in VisIt’s 3rd party libraries so special builds were required Mesa built unmangled and no X11 VTK tricky to build –No OpenGL so VTK built with Mesa as its OpenGL –No X11 so created custom render window –Used CMake toolchain file
Building VisIt No X11 so graphical components can’t be built for CNK (don’t build gui) Added new --enable-engine-only build mode to VisIt’s build system that only builds the compute engine and its plugins VisIt always used to require mangled mesa –This support had to become conditional on VTK having mangled mesa support No X11 so graphical components can’t be built for CNK (don’t build gui) Added new --enable-engine-only build mode to VisIt’s build system that only builds the compute engine and its plugins VisIt always used to require mangled mesa –This support had to become conditional on VTK having mangled mesa support
Running VisIt on Dawn Dawn uses mpirun to start VisIt on compute nodes –Minor differences required environment variables to be exported via mpirun command, which could be handled via host profile in VisIt VisIt ran at 1k,2k,4k,8k,16k nodes VisIt ran with 1 and 4 trillion zone datasets (June09) Encountered scaling problems early –Launch time slow because each processor was reading plugin directory to obtain plugin information –VisIt commands were sent from rank 0 to other ranks 1Kb at a time until a message was sent –Non-spinning bcast substitute used for sending commands had point-to-point that performed poorly at scale –Certain metadata consumed too much memory (each processor has ~700Mb only) –Synchronization step for SR mode used slow point-to-point Dawn uses mpirun to start VisIt on compute nodes –Minor differences required environment variables to be exported via mpirun command, which could be handled via host profile in VisIt VisIt ran at 1k,2k,4k,8k,16k nodes VisIt ran with 1 and 4 trillion zone datasets (June09) Encountered scaling problems early –Launch time slow because each processor was reading plugin directory to obtain plugin information –VisIt commands were sent from rank 0 to other ranks 1Kb at a time until a message was sent –Non-spinning bcast substitute used for sending commands had point-to-point that performed poorly at scale –Certain metadata consumed too much memory (each processor has ~700Mb only) –Synchronization step for SR mode used slow point-to-point
Improvements Broadcast plugin information from rank 0 to other ranks to improve plugin loading time 9x Broadcast VisIt commands from rank 0 in a single chunk instead of 1Kb at a time Use standard bcast in engine main loop instead of poorly performing non-spin substitute geared towards shared nodes Switched to alternate metadata representation to free up most available memory for calculations Mark Miller was able to replace SR mode synchronization step with much faster version that reduced time to 2 seconds from 20 minutes Broadcast plugin information from rank 0 to other ranks to improve plugin loading time 9x Broadcast VisIt commands from rank 0 in a single chunk instead of 1Kb at a time Use standard bcast in engine main loop instead of poorly performing non-spin substitute geared towards shared nodes Switched to alternate metadata representation to free up most available memory for calculations Mark Miller was able to replace SR mode synchronization step with much faster version that reduced time to 2 seconds from 20 minutes
Impact So far this project’s impact has been small for customers –They do not yet run on Dawn –They might not notice small improvements at today’s everyday processor counts (<2k) At higher processor counts (>4k) optimizations added by this work prevent bottlenecks in compute engine, improving scalability So far this project’s impact has been small for customers –They do not yet run on Dawn –They might not notice small improvements at today’s everyday processor counts (<2k) At higher processor counts (>4k) optimizations added by this work prevent bottlenecks in compute engine, improving scalability
Future work Resolve load problems with xlC compiler so we can use the best optimizations, including using BG/P’s dual FPU’s Improve 3rd party library build process for BG/P by adding support in build_visit script Continue profiling plots and improving performance Reduce memory usage where possible Investigate I/O patterns and attempt optimizations Resolve load problems with xlC compiler so we can use the best optimizations, including using BG/P’s dual FPU’s Improve 3rd party library build process for BG/P by adding support in build_visit script Continue profiling plots and improving performance Reduce memory usage where possible Investigate I/O patterns and attempt optimizations