†UCSD, ‡UCSB, EHTZ*, UNIBO* Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing Abbas Rahimi†, A. Ghofrani‡, M. A. Montano‡, K-T Cheng‡, L. Benini*, R. K. Gupta† †UCSD, ‡UCSB, EHTZ*, UNIBO* Micrel.deis.unibo.it /MultiTherman Variability.org
Energy-Efficient GPGPU Thousands of deep and wide pipelines make GPGPU high power consuming parts NT and VOS achieve energy efficiency at costs to Performance loss Increasing timing sensitivity in the presence of variations ✓SIMD × conservative guardbands loss of operational efficiency Total delay: corner + 3σ stochastic delay Kakoee et al, TCAS-II’12 guardband
Variability is about Cost and Scale Eliminating guardband Timing error Bowman et al, JSSC’09 error rate × wider width Wide lanes Costly error recovery for SIMD Recovery cycles increases linearly with pipeline length quadratically expensive Deep pipes
Taxonomy of SIMD Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Hierarchically focused guardbanding and uniform instruction assignment Error recovery Rahimi et al, DATE’13 Rahimi et al, DAC’13 Exact / approximate computing Exact computing Predict & prevent Memoization Independent recovery Recalling recent context of error-free execution (approximately / exactly) Lane decoupling through private queues Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 Rahimi et al, TCAS’13 Rahimi et al, DATE’14 Detect-then-correct
Contributions Efficient spatiotemporal reuse of computation in GPGPUs by collaborative Micro-architectural design An associative memristive memory (AMM) module is integrated with FPUs − representing partial functionality Compiler profiling Fine-grained partitioning of values (searching space of possible inputs) Pre- storing high-frequent sets of values in AMM modules Ensure their resiliency under voltage overscaling for Evergreen GPGPUs
Collaborative compilation framework and memristive-based computing Training datasets OpenCL Kernel Profiler Profiling Highly frequent computations one-off activity Customized clCreateBuffer to insert AMM contents 2) Code generation AMM contents Kernel lunching kernel programming FPU AMM 3) Runtime =?
Return pre-stored result AMM with FPU Error No Recovery AMM: Software programmable Mimics partial functionality of FPU Two pipelined stages Return pre-stored result Search Operands TCAM: a self-referenced sensing scheme†, 2-bit encoding, 15% positive slack at 45nm Memory block: avoids read disturbance Ternary content addressable memory (TCAM) Crossbar-based 1T-1R memristive memory block †Li et al, JSSC’14
Programming before lunching kernel OpenCL Sobel AMM Hit Rates Profiler +: {a, b} → {q} *: {a, b} → {q} √ : {a} → {q} … train test1 offline Programming before lunching kernel FPU+ AMM+ test2 FPU* AMM* FPU√ AMM√ … test3 runtime test4
Efficiency under Voltage Overscaling 33% 30% 36% 19% 17% 33% 28% 32% 39% 29% 37% 28% Reduce timing errors from 38% to 24% At 1.0V, without any timing error, 36% average energy saving (7 kernels) At 0.88V, on average 39% energy saving
Conclusion Static compiler analysis and coordinated microarchitectural design that enable efficient reuse of computations in GPGPUs Emerging associative memristive modules are coupled with FPU for fast spatial and temporal reuse GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% on the 32-entry AMMs