System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable Applications: When MPI-only is not enough June 4, 2008 Kevin Pedretti Scalable System Software Dept. Sandia National Laboratories ktpedre@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Near Term Odds are good, but goods are odd... –Multi-core, many-core, mega-core –Heterogeneous ISAs, cores, systems –Accelerators: GPU, Cell, Clearspeed, FPGA, etc. –Embedded: Tilera, SPI, Ambric (336-core), Tensilica Scalable Architectures –Peak FLOPS not bottleneck –Improving per-socket efficiency on real applications is “low-hanging fruit” –Decreasing memory size & bandwidth per core –Symbiosis of architecture and system software

Near Term (Cont.)‏ Adapting MPI implementations for architecture –Shared memory copies vs. NIC –Cache pollution, injection –Leverage hierarchy / intra-node locality Adapting MPI applications for architecture –MPI + shared memory: LIBSM –MPI + something else for intra-node OpenMP, Thread Building Blocks, ALF Streaming, CUDA, Rapid Mind, Peakstream/Google, etc. All incompatible, some similar concepts Adapting architecture for MPI? Leveraging interconnect capabilities for PGAS

At 8192 nodes, CNL (2.0.44) is 49% worse than Catamount on this Partisn problem. Doesn’t appear to be a bandwidth issue. OS Scalability

Task and Memory Placement No standard mechanisms, most punt and hope for best Explicit vs. implicit mechanisms More important than node placement?

Intra-node MPI

Virtual Memory Nice, but Gets in Way Dashed Line = Small pages Solid Line = Large pages (Dual-core Opteron)‏ Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck)‏ Solid Shapes = New Constant-Time Algorithm (Slepoy, Thompson, Plimpton)‏ TLB misses increased with large pages, but time to service miss decreased dramatically (10x). Page table fits in L1! (vs. 2MB per GB with small pages)‏ Unexpected Behavior Due to TLB

So, Answer is Large Pages? DRAM bank conflicts can be considerable depending on data alignment OS-level and hardware mitigation strategies

Affects SpMV Also (28 Node HPCCG Run)‏

Medium Term More accelerators, normalization –Attractive power and memory efficiency –Commodity processors will integrate GPUs on-chip –HPC-centric off-chip accelerators General-purpose cores not getting much faster Leverage architecture for specific app domains –Some common mechanism will/must emerge for dealing with data-parallel accelerators General-purpose cores become more light-weight, better match for light-weight system software –Chip stacking –Off-chip optics

Long Term MPP-on-a-chip On and off-chip optics More intelligent memory systems Application driven architectures

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

Similar presentations

Presentation on theme: "System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

Similar presentations

Presentation on theme: "System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable."— Presentation transcript:

Similar presentations

About project

Feedback