HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com
Convey Hybrid-Core Computing Intel® Processor Coprocessor Oil & Gas Financial Custom CAE Sciences Application-Specific Personalities Cache-coherent shared virtual memory Applications x86-64 Instructions Coprocessor Instructions Convey Compilers An x86 processor is combined with a coprocessor that implements highly parallel instructions Copyright /10/09 2
Using Personalities Convey Software Development Suite Hybrid-Core Executable x86-64 and Coprocessor Instructions Hybrid-Core Executable x86-64 and Coprocessor Instructions C/C++ Fortran Convey HC-1 Intel x86 Coprocessor P P Personalities description file specifies available instructions personality loaded at runtime by OS Program using ANSI standard C/C++ and Fortran User specifies personality at compile time OS demand loads personalities at runtime Copyright /10/09 3
Language/Library Support for Massive Parallelism Apply massive amounts of logic for a single thread of execution – Do that via specialization – Have hardware adapt to the application rather than the application adapting to the hardware – Parallelizing at the instruction level not the core level C/C++ and Fortran programming – No special languages – Code can run on X86 servers without coprocessors Copyright /10/09 4 multiple units in each pipe for instruction level parallelism instructions can be very complex Multiple function pipes for data parallelism Crossbar Dispatch Crossbar Dispatch Crossbar Dispatch Crossbar Dispatch
Development Tools executable Intel® 64 code Coprocessor code C/C++ Fortran95 Common Optimizer Intel® 64 Optimizer & Code Generator Convey Vectorizer& Code Generator Procedural Personality Interface Linker other objects Program in ANSI standard C/C++ and Fortran Unified compiler generates x86 & coprocessor instructions Seamless debugging environment for Intel & coprocessor code Executable can run on x86_64 nodes or on Convey Hybrid-Core nodes Copyright /10/09
Multi Mode Compilation 09/10/09 Original code: for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; Generated code: if(CP available) { coprocessor instructions } else { x86 instructions } Convey backend x86-64 backend x86-64 backend Personality Definition Files Convey Multi Mode Compiler Convey systems are inherently heterogeneous Can select from a set of architectures Required architectures are dynamically loaded at runtime Higher level parallelism supported via MPI or threads Copyright
Custom Convey Runtime Intel® 64 code Coprocessor code Convey Shared Libraries Convey Simulator shared library launched by OS cny_runtime.o executable coprocessor hardware gdb debugging on HW & simulator SPAT performance simulator if dlopen of shared library fails, Intel 64 code executed x86-64 hardware FAP DP SP personalities are demand loaded by OS at runtime Copyright /10/09
Debugging Hybrid-Core Applications (gdb) run Starting program: /home/guest/Desktop/DEMOS/compiler_demo/vec_auto.exe Breakpoint 1, main (argc=1, argv=0x7fffa9111ee8) at vec_main.c:19 19 for (i=0; i<n; i++) { (gdb) disass Dump of assembler code for function main: 0x : push %rbp 0x : mov %rsp,%rbp 0x c : add $0xffffffffffffffb0,%rsp 0x : add $0xfffffffffffffff8,%rsp 0x : fnstcw (%rsp) 0x : andw $0xfcff,(%rsp) 0x d : orw $0x300,(%rsp) (gdb) cont Continuing. Breakpoint 4, 0x a0 in __cny_region_triad0 () (gdb) disass 0x8000a0 0x8000c0 Dump of assembler code from 0x8000a0 to 0x8000c0: 0x a0 : mov %a11,%VL 0x a8 : ld.dw $0x0(%a10),%v0r 0x ac : or %a11,$0,%a13 0x b0 : ld.dw $0x0(%a9),%v1r 0x b4 : add.sq %a12,%a13,%a12 0x b8 : fma.fs %v0r,%s1,%v1r,%v0r 0x c0 : st.dw %v0r,$0x0(%a8) End of assembler dump. (gdb) Copyright /10/09
Copyright /10/09
Third Party Libraries Third Party Libraries run unmodified on the X86 Key kernels have been optimized by Convey Third party libraries can call Convey optimized routines – BLAS – LAPACK – etc. Copyright /10/09 10