0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory Gilbert Hendry: Sandia National Laboratory Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab Sudhakar Yalamanchili: Georgia Tech Data Movement Dominates (DMD) and CoDEx: CoDesign for Exascale
Codesign Tools Recap Architectural Simulation to Accelerate CoDesign SST System level models ACE Node level emulation ROSE Application Analysis ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system- scale simulation through capture of application communication traces and simulation of large- scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation
Codesign Tools Recap Architectural Simulation to Accelerate CoDesign SST System level models ACE Node level emulation ROSE Application Analysis ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system- scale simulation through capture of application communication traces and simulation of large- scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program) SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program)
Codesign Tools Recap Architectural Simulation to Accelerate CoDesign SST System level models ACE Node level emulation ROSE Application Analysis ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system- scale simulation through capture of application communication traces and simulation of large- scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program) SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program) CAL: (Sandia/LBL) Computer Architecture Laboratory CAL: (Sandia/LBL) Computer Architecture Laboratory
Fidelity vs. Scope for Architectural Simulation Methods 4
ROSE Compiler Full Program Understanding through Deep Source-Code Analysis 5
Can automatically predict performance for many input codes and software optimizations Predict performance under different architectural scenarios Much faster than hardware simulation and manual modeling ExaSAT: Exascale Static Analysis Tool Compiler-Automated Performance Model Extraction 6 Combustion Codes Compiler Analysis Performance Prediction Spreadsheet Dependency Graph Optimization User Parameter s User Parameter s Performanc e Model Machine Parameter s
SST/macro: Coarse-Grained Simulation 7 An application code with minor modifications SST/Macro Impl. of interfaces (MPI), which simulate execution and communication
SST/micro: Cycle-Accurate Framework Has a general simulation framework for integrating models Simulation backend is parallel Plenty of people involved 8
Some Models Currently Integrated 9 Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components. MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces. IRIS provides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on- Chip (NoC) and inter- node interconnection architectures. PhoenixSim models photonic networks.
Leveraging Embedded Design Automation For Design Space Exploration This stuff is essential!
Embedded Design Automation (Using FPGA emulation to do rapid prototyping) RAMP FPGA-accelerated Emulation of ASIC Or “tape out” To FPGA
Data Movement Dominates (Sandia, Micron, Columbia, LBL) Understand the Potential of Intelligent, Stacked DRAM Technology Data movement are projected to account for over 75% of power budget for an exascale platform Work to reduce that via –Optical interconnect(s) –3D stacking (logic + memory + optics) –New memory protocols Research Questions –What is the performance potential of stacked memory (power & speed) –How much intelligence to put into logic layer Atomics, gather/scatter, checksums, full-processor-in-memory –What is the memory consistency model for intelligent DRAM –How to program it if we put embed more intelligence into DRAM
The Cost of Moving Data
Locality Management is Key What are the best combination of software and hardware mechanisms to maximize data movement efficiency Vertical Locality Management Horizontal Locality Management 14 Sun Microsystems Temporal Topological
Why Study Chip Stacking (TSVs)? Energy = (V 2 ∗ C) ∗ Overhead + Ecomm DRAM Cells Efficient DRAM cells require < 1 pJ to access Current DRAM architectures are not power efficient Long distances ➔ high power We pay for more than we get at every level –Cache: throw away 75-80% –DRAM Row: Charge 1024B for each 64B access –DIMM: Charge 8-9 chips/access –~800 pJ/byte total DRAM design driven by packaging constraints –~50% of DRAM chip cost is packaging, mainly in pins –DIMMs use multiple chips with a few data pins to achieve high BW TSVs Reduce Costs TSVs orders of magnitude less energy –250 fJ/bit for reading DRAM –5 fJ/bit for TSV –250 fJ/bit for mem. controller –~0.5 pJ/bit (compared to 30pJ for conventional DIMM) –Don’t have to access more data than needed Enables.... –Lower Capacitance: Narrower – Lower Overhead: Smarter –In-Memory computation Requires –...changes to how we view the machine & the memory 15
Why Photonics? ELECTRONICS: Buffer, receive, and re-transmit at every router. Space Parallelism: Each bus lane routed independently (P N LANES ). Off-chip BW requires much more power than on-chip BW.ELECTRONICS: Buffer, receive, and re-transmit at every router. Space Parallelism: Each bus lane routed independently (P N LANES ). Off-chip BW requires much more power than on-chip BW. Photonics changes the rules for Bandwidth-per-Watt. PHOTONICS: Modulate/receive data stream once per communication event. Wavelength Parallelism: Broadband switch routes entire multi-wavelength stream. Off-chip BW ≈ on-chip BW for nearly same power. PHOTONICS: Modulate/receive data stream once per communication event. Wavelength Parallelism: Broadband switch routes entire multi-wavelength stream. Off-chip BW ≈ on-chip BW for nearly same power.
HBDRAM Large Pin-out Complex wiring Low bandwidth density Distance constrained by electrical limitations High power dissipation All-optical link, no electronic bus to drive Bit-rate transparent link High bandwidth density, less pins Distance immunity at computer scale Low power dissipation Optical Link Traditional MemoryOptically-Connected Memory Why Optically-Connected Memory? Will not scale to meet power and bandwidth requirements of future high- performance computing systems Enables scaling of high-performance computing through increased memory capacity and bandwidth CPU HBDRAM CPU HBDRAM Electronic Bus
18
Mixed Model Simulation cycle accurate and energy-accurate models SST/macro skeleton app (C, C++, Fortran) skeleton app (C, C++, Fortran) (C++) NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) Memory Model (DRAMSim2, FLASHsim, NVRAM) Address Translation Processor Model (SST/micro & Tensilica) Processor Model (SST/micro & Tensilica) Workload Translation kernels SystemC Fault Injection Checkpoint/r estart MPI Traces (DUMPI) MPI Traces (DUMPI)
Simulator Infrastructure: Interconnects cycle accurate and energy-accurate models Developed by Sandia Collaborators CoDEx project
Simulator Infrastructure: Memory cycle accurate and energy-accurate models Validated against Micron DRAM HMC model coming this summer
Simulator Infrastructure cycle accurate and energy-accurate models Rewrote Columbia PhoenixSim summer 2011 Orion-2 energy model Validated against Cornell test parts
Simulator Infrastructure cycle accurate and energy-accurate models Full Gate-level RTL model of processor Well characterized energy model