LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Salishan 4/24/2014

Lawrence Livermore National Laboratory LLNL-PRES-653431 2  Tuning large complex applications for each hardware generation is impractical Performance Productivity Code Base Size Solutions must be general, adaptable to the future and maintainable

Lawrence Livermore National Laboratory LLNL-PRES-653431 3 What drives these designs?  Handheld mobile device weight and battery life  Exascale power goals  Power cost Lower power reduces performance and reliability Power vs. frequency for Intel Ivy Bridge

Lawrence Livermore National Laboratory LLNL-PRES-653431 4 Chips operating near threshold voltage encounter  More transient errors  More hard errors Checkpoint restart is our current reliability mechanism

Lawrence Livermore National Laboratory LLNL-PRES-653431 5 Complex power saving features  SIMD and SIMT  Multi-Level memory systems  Heterogeneous systems In-Package Memory Memory Processing Multi-Core CPU In-Package Memory GPU Memory Processing NVRAM Exploiting these features is difficult

Lawrence Livermore National Laboratory LLNL-PRES-653431 6  No production GPU or Xeon Phi code GPU and Xeon Phi optimizations are different  No production codes explicitly manage on-node data motion  Less than 10% of our FLOPs use SIMD units, even with the best compilers Architecture dependent data layouts may hinder the compiler Mechanisms are needed to isolate architecture specific code

Lawrence Livermore National Laboratory LLNL-PRES-653431 7  We add directives to existing codes where portable  Multi-level memory handled by OS, runtime or used as a cache  We continue to get little SIMD and probably a bit better SIMT parallelism Overall performance improvement is incremental at best

Lawrence Livermore National Laboratory LLNL-PRES-653431 8  Are our algorithms well suited for future machines?  Can we rewrite our data structures to match future machines? We will address these questions in the next few slides

Lawrence Livermore National Laboratory LLNL-PRES-653431 9  Loop fusion Make each operator a single sweep over a mesh  Data structure reorganization  Reduce mallocs or Use better libraries LULESH BG/Q However, better implementations only get us 2-3x

Lawrence Livermore National Laboratory LLNL-PRES-653431 10  Throughput optimized processors execute serial sections slowly  Design codes with limited serial sections  Better runtime support is needed to reduce serial overhead OpenMP Malloc Libraries Use latency optimized processor for what remains

Lawrence Livermore National Laboratory LLNL-PRES-653431 11  More parallelism exists in current algorithms than we exploit today  Code changes are required to express parallelism more clearly  SIMT or SIMD with HW Gather/Scatter are easier to exploit LULESH Sandy Bridge Bandwidth constraints will eventually limit us

Lawrence Livermore National Laboratory LLNL-PRES-653431 12 Many of today’s apps need 0.5-2 bytes for every FLOP performed.

Lawrence Livermore National Laboratory LLNL-PRES-653431 13 Excess FLOPs

Lawrence Livermore National Laboratory LLNL-PRES-653431 14  More FLOPs per byte  Small dense operations  More accurate  Potentially more robust and better symmetry preservation B to F Requirement vs. Algorithmic Order

Lawrence Livermore National Laboratory LLNL-PRES-653431 15  How do you use the FLOPs efficiently?  What does high-order accuracy mean when a there is a shock?  Can you couple all they physics we need at high-order? We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…

Lawrence Livermore National Laboratory LLNL-PRES-653431 16

Lawrence Livermore National Laboratory LLNL-PRES-653431 17 Mechanisms are needed to isolate non-portable optimizations

Lawrence Livermore National Laboratory LLNL-PRES-653431 18  RAJA, Kokkos and Thrust allow portable abstractions in today’s codes Charm++ Liszt

Lawrence Livermore National Laboratory LLNL-PRES-653431 19 Algorithms Programming Models Architectures RAJA Today’s High Order

LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Similar presentations

Presentation on theme: "LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Similar presentations

Presentation on theme: "LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."— Presentation transcript:

Similar presentations

About project

Feedback