Download presentation
Presentation is loading. Please wait.
Published byLewis Montgomery Modified over 9 years ago
1
What hardware accelerators are you using/evaluating? Cells in a Roadrunner configuration ◦ 8-way SPE threads w/ local memory, DMA & vector unit programming issues but tremendous flexibility ◦ Fast (25.6 GB/s) & large memory (4GB or larger) ◦ Augmented C language; also C++ & now Fortran; GNU & XL variants; OpenMP is new; OpenCL is being prototyped ◦ Opterons can run bulk of code not needing acceleration; Cell-only clusters possible
2
What hardware accelerators are you using/evaluating? Several years ago… ◦ GPUs (pre CUDA & Tesla) Brook & Scout (LANL data-parallel language) No 32bit at the time; limited memory; everything is a data-parallel problem No ECC memory ; insufficient parity/ECC protection of data paths and logic Others at LANL still working in this area including Tesla & CUDA) ◦ Clearspeed (several years ago) Earliest Clearspeeds before the Advance families Augmented C language; 96 SIMD PEs Everything is done as long SIMD data parallel and in synch Low power ◦ FPGAs (HDL, several years ago) Programming is hard -- very hard Logic space limited the number of 64bit ops Fast SRAM but small; external DRAM modest size but no faster than CPUs One algorithm at a time, so significant impact to use for multi-physics Low power
3
Describe the applications that you are porting to accelerators? ◦ MD (materials), laser-plasma PIC, IMC X-ray (particle) transport, GROMACS, n-body universe & galaxies, DNS turbulence & supernovea, HIV genealogy, nanowire long-time-scale MD ◦ Ocean circulation, wildfires, discrete social simulations, clouds & rain, influenza spread, plasma turbulence, plasma sheaths, fluid instabilities My personal observations: ◦ Particle methods are generally easiest ◦ Codes with good characteristics: A few computationally intense “algorithms” pre-existing or obvious “fine-grain” parallel work units C language versus Fortran or highly OO C++
4
Describe the kinds of speed-ups are you seeing (provide the basis for the comparison)? ◦ 5x to 10X over single-Opteron-core for code with high memory BW intensive and 5%-10% peak ◦ 10x to 25x on particle methods, searches, etc. How does it compare to scaling out (i.e., just using more X86 processors)? What are the bottlenecks to further performance improvements? ◦ Scale out via more sockets is better – BUT! Scaling efficiencies are a problem already for several LANL applications running at 4,000 to 10,000 cores; scale out of LANL- sized machines means $$$ for HW, space, & power Scaling out by multi-core is not a clear winner ◦ Memory BW and cache architectures often limit performance which Cells mostly get around ◦ Memory BW per core is decreasing at “inverse Moore’s law” rate!
5
Describe the programming effort required to make use of the accelerator. ◦ ½ to 1 man-year to “convert” a code, mostly dealing with data structures and threaded parallelism designs. ◦ Lack of debugging & similar tools are like the earliest days of parallel computing (LANL was leader then as well – remember early PVM Ethernet workstation “carpet” clusters in the mid-80’s before MPPs) ◦ We like to see 1-2 programming experts (PhD-level or equiv) assigned to forefront-science code projects which have 1 to 4+ physics experts (PhD-level) Amortization ◦ Ready for the future – codes and skilled programmers. We expect our dual-level (MPI+threads) & SIMD-vectorization techniques used for Roadrunner to pay off on future multi-core and many-core chips as well. ◦ It’s not just about running codes this year. Others will have to work through new forms of parallelism soon. ◦ We can do science now that isn’t possible with most other machines
6
Compare accelerator cost to scaling out cost ◦ Commodity-processor-only machines would have cost 2X what Roadrunner did in 2006-2007 (~$80M more) ◦ Used 2X or more power (~$1M per MW) ◦ Significantly larger nodes counts cause scaling & reliability issues ◦ Accelerators or heterogeneous chips should be Greener Ease of use issues ◦ Newer Cell programming techniques (ALF, OpenMP) could make this easier. ◦ A Cell cluster would be easier, but the PPE is really, really slow for non- SPU accelerated code segments. ◦ Not for the faint of heart, but Top20 machines never are
7
What is the future direction of hardware based accelerators? ◦ Domain specific libraries can make them far more useful in those specific areas ◦ Some may appear on Intel QPI or AMD HT. ◦ Specialized cores will show up within commodity microprocessors – ignore them or use them ◦ GPU-based systems will have to adopt ECC & partity protection ◦ Convey appears to have the most viable FPGA approach (FPGA as compiler managed co-processor) Software futures? ◦ OpenCL looks promising but doesn’t address programming the specialized accelerator devices themselves ◦ The uber-auto-wizard-compiler will never come ◦ Heterogeneous compilers may come. ◦ Debuggers & tools may come What are your thoughts on what the vendors need to do to ensure wider acceptance of accelerators? ◦ Create next generation versions and sell as mainstream products
8
Compile & run on PowerPC PPE Identify & isolate algorithm & data to run parallel on 8 “remote” SPEs Compile scalar version of algorithm on SPE ◦ Add SPE thread process control ◦ Add DMAs Use “blocking” DMAs at this stage just for functionality Worry about data alignments ◦ First on a single SPE, then on 8 SPEs Optimize SPE code ◦ SIMD, branches merges ◦ Add asynch double/triple buffering of DMAs For Roadrunner, connect to rest of code on Opteron via DaCS and “message relay”
9
Roadrunner is more than a petascale supercomputer for today’s use ◦ provides a balanced platform to explore new algorithm design, programming models, and to refresh developer skills LANL has been an early adopter of transformational technology*: ◦ 1970s: HPC is scalar LANL adopts vector (Cray 1 w/ no OS) ◦ 1980s: HPC is vector LANL adopts data parallel (big CM-2) ◦ 2000s: HPC is multi-core clusters LANL adopts hybrid (Roadrunner) Slide 9 *Credit to Scott Pakin, CCS-1, for this list idea
10
OpteronCell PPCCell SPE (x8 parallel) Host data pushed/pulled to Cell Cell spawns parallel threads on SPEs Parallel threads completed Node may need to push/pull more data to/from Cell & to/from cluster or could be available for concurrent work during this time Host launches Cell code Cell code completed (1) (2) (3) (6) (5b)(5a) MPI Updated data pushed/pulled to Host Non-accelerated code Each SPE computes within its local memory buffers Each SPE DMA multi-buffers data back to Cell memory Each SPE DMA multi-buffers Cell data into local memory (4) until done Simultaneously Node (Opteron) Serial PPC Processor Node Memory Cell Memory Parallel SPE Processors Local Memories (1)(2)(6) (3) (4) 8-way parallel MPI (5B) PCIe link (5a) How much can be automated in compilers or languages? DaCS DMA DaCS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.