Presentation is loading. Please wait.

Presentation is loading. Please wait.

LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Similar presentations


Presentation on theme: "LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."— Presentation transcript:

1 LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Salishan 4/24/2014

2 Lawrence Livermore National Laboratory LLNL-PRES-653431 2  Tuning large complex applications for each hardware generation is impractical Performance Productivity Code Base Size Solutions must be general, adaptable to the future and maintainable

3 Lawrence Livermore National Laboratory LLNL-PRES-653431 3 What drives these designs?  Handheld mobile device weight and battery life  Exascale power goals  Power cost Lower power reduces performance and reliability Power vs. frequency for Intel Ivy Bridge

4 Lawrence Livermore National Laboratory LLNL-PRES-653431 4 Chips operating near threshold voltage encounter  More transient errors  More hard errors Checkpoint restart is our current reliability mechanism

5 Lawrence Livermore National Laboratory LLNL-PRES-653431 5 Complex power saving features  SIMD and SIMT  Multi-Level memory systems  Heterogeneous systems In-Package Memory Memory Processing Multi-Core CPU In-Package Memory GPU Memory Processing NVRAM Exploiting these features is difficult

6 Lawrence Livermore National Laboratory LLNL-PRES-653431 6  No production GPU or Xeon Phi code GPU and Xeon Phi optimizations are different  No production codes explicitly manage on-node data motion  Less than 10% of our FLOPs use SIMD units, even with the best compilers Architecture dependent data layouts may hinder the compiler Mechanisms are needed to isolate architecture specific code

7 Lawrence Livermore National Laboratory LLNL-PRES-653431 7  We add directives to existing codes where portable  Multi-level memory handled by OS, runtime or used as a cache  We continue to get little SIMD and probably a bit better SIMT parallelism Overall performance improvement is incremental at best

8 Lawrence Livermore National Laboratory LLNL-PRES-653431 8  Are our algorithms well suited for future machines?  Can we rewrite our data structures to match future machines? We will address these questions in the next few slides

9 Lawrence Livermore National Laboratory LLNL-PRES-653431 9  Loop fusion Make each operator a single sweep over a mesh  Data structure reorganization  Reduce mallocs or Use better libraries LULESH BG/Q However, better implementations only get us 2-3x

10 Lawrence Livermore National Laboratory LLNL-PRES-653431 10  Throughput optimized processors execute serial sections slowly  Design codes with limited serial sections  Better runtime support is needed to reduce serial overhead OpenMP Malloc Libraries Use latency optimized processor for what remains

11 Lawrence Livermore National Laboratory LLNL-PRES-653431 11  More parallelism exists in current algorithms than we exploit today  Code changes are required to express parallelism more clearly  SIMT or SIMD with HW Gather/Scatter are easier to exploit LULESH Sandy Bridge Bandwidth constraints will eventually limit us

12 Lawrence Livermore National Laboratory LLNL-PRES-653431 12 Many of today’s apps need 0.5-2 bytes for every FLOP performed.

13 Lawrence Livermore National Laboratory LLNL-PRES-653431 13 Excess FLOPs

14 Lawrence Livermore National Laboratory LLNL-PRES-653431 14  More FLOPs per byte  Small dense operations  More accurate  Potentially more robust and better symmetry preservation B to F Requirement vs. Algorithmic Order

15 Lawrence Livermore National Laboratory LLNL-PRES-653431 15  How do you use the FLOPs efficiently?  What does high-order accuracy mean when a there is a shock?  Can you couple all they physics we need at high-order? We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…

16 Lawrence Livermore National Laboratory LLNL-PRES-653431 16

17 Lawrence Livermore National Laboratory LLNL-PRES-653431 17 Mechanisms are needed to isolate non-portable optimizations

18 Lawrence Livermore National Laboratory LLNL-PRES-653431 18  RAJA, Kokkos and Thrust allow portable abstractions in today’s codes Charm++ Liszt

19 Lawrence Livermore National Laboratory LLNL-PRES-653431 19 Algorithms Programming Models Architectures RAJA Today’s High Order


Download ppt "LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."

Similar presentations


Ads by Google