Download presentation
Presentation is loading. Please wait.
Published byJason Rogers Modified over 9 years ago
1
System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable Applications: When MPI-only is not enough June 4, 2008 Kevin Pedretti Scalable System Software Dept. Sandia National Laboratories ktpedre@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
2
Near Term Odds are good, but goods are odd... –Multi-core, many-core, mega-core –Heterogeneous ISAs, cores, systems –Accelerators: GPU, Cell, Clearspeed, FPGA, etc. –Embedded: Tilera, SPI, Ambric (336-core), Tensilica Scalable Architectures –Peak FLOPS not bottleneck –Improving per-socket efficiency on real applications is “low-hanging fruit” –Decreasing memory size & bandwidth per core –Symbiosis of architecture and system software
3
Near Term (Cont.) Adapting MPI implementations for architecture –Shared memory copies vs. NIC –Cache pollution, injection –Leverage hierarchy / intra-node locality Adapting MPI applications for architecture –MPI + shared memory: LIBSM –MPI + something else for intra-node OpenMP, Thread Building Blocks, ALF Streaming, CUDA, Rapid Mind, Peakstream/Google, etc. All incompatible, some similar concepts Adapting architecture for MPI? Leveraging interconnect capabilities for PGAS
4
At 8192 nodes, CNL (2.0.44) is 49% worse than Catamount on this Partisn problem. Doesn’t appear to be a bandwidth issue. OS Scalability
5
Task and Memory Placement No standard mechanisms, most punt and hope for best Explicit vs. implicit mechanisms More important than node placement?
6
Intra-node MPI
7
Virtual Memory Nice, but Gets in Way Dashed Line = Small pages Solid Line = Large pages (Dual-core Opteron) Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck) Solid Shapes = New Constant-Time Algorithm (Slepoy, Thompson, Plimpton) TLB misses increased with large pages, but time to service miss decreased dramatically (10x). Page table fits in L1! (vs. 2MB per GB with small pages) Unexpected Behavior Due to TLB
8
So, Answer is Large Pages? DRAM bank conflicts can be considerable depending on data alignment OS-level and hardware mitigation strategies
9
Affects SpMV Also (28 Node HPCCG Run)
10
Medium Term More accelerators, normalization –Attractive power and memory efficiency –Commodity processors will integrate GPUs on-chip –HPC-centric off-chip accelerators General-purpose cores not getting much faster Leverage architecture for specific app domains –Some common mechanism will/must emerge for dealing with data-parallel accelerators General-purpose cores become more light-weight, better match for light-weight system software –Chip stacking –Off-chip optics
11
Long Term MPP-on-a-chip On and off-chip optics More intelligent memory systems Application driven architectures
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.