1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015
2 Massively Parallel Computing CUDA/OpenCL are gaining track in high-performance computing (HPC) – Same code; different data GPUs deliver better FLOPS per Watt – Available in mobile systems and supercomputers But… GPGPUs still suffer from von-Neumann inefficiencies 2
3 November 11, 2015 von-Neumann inefficiencies Fetch/Decode/Issue each instruction – Even though most instructions come from loops Explicit storage needed for communicating values between instructions – Register file; stack – Data travels between execution units and storage 3 [Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10] Compo nent Inst. fetch Pipeline registers Data cache Register file ControlALU Power [%] 33%22%19%10% 6%
4 November 11, 2015 Quantifying inefficiencies: instruction pipeline Every instruction fetched, decoded and issued Very wasteful Most of the execution time is spent in (tight) loops Avg. pipeline power consumption: – NVIDIA Tesla >10% of processor power [Hong and Kim. ISCA’10] – NVIDIA Fermi ~15% of processor power [Leng et al. ISCA’13] 4
5 November 11, 2015 Quantifying Inefficiencies: Register File Communication via bulletin board – 40% of values only read once [Gebhart et al. ISCA’11] Avg. register file power consumption: – NVIDIA Tesla 5-10% of processor power [Hong and Kim. ISCA’10] – NVIDIA Fermi >15% of processor power [Leng et al. ISCA’13] 5
6 November 11, 2015 Alternatives to von-Neumann: Dataflow/spatial computing Processor is a grid of functional units Computation graph is mapped to the grid – Statically, at compile time No energy wasted on pipeline – Instructions are statically mapped to nodes No energy wasted on RF and data transfers – No centralized register file needed – Save static power and area (128KB on Fermi) 6
7 November 11, 2015 Spatial/Dataflow Computing 7 int temp1 = a[threadId] * b[threadId]; int temp2 = 5 * temp1; if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;} else result[threadId] = temp2; athreadIdxentryb IMM_5S_LOAS1S_LOAD2 ALU1_mulALU2_mulJOIN1 IMM_3ALU4_ashlALU3_icmpIMM_256 if_elseif_then S_SOTRE3resultS_SOTRE4
8 November 11, 2015 SGMF: A Massively Multithreaded Dataflow Architecture Every thread is a flow through the dataflow graph Many threads execute (flow) in parallel 8
9 November 11, 2015 Execution Overview: Dynamic Dataflow Each flow/thread is associated with a token Execute the operation when tokens match Parallelism is determined by the number of tokens in the system 9 OoO LD/ST units token matching
10 November 11, 2015 DESIGN ISSUES A Massively Multithreaded Dataflow Processor 10
11 November 11, 2015 Multithreading Design Issues: Preventing Deadlocks Imbalanced out-of-order memory responses may trigger deadlocks 11 Deadlock due to limited buffer space OoO LD/ST units Solution: load-store units limit bypassing to the size of the token buffer
12 November 11, 2015 Design issues: Variable path lengths Short paths must wait for long paths 12 a b c x x + + x Bubble Solution: equalize paths’ lengths
13 November 11, 2015 Design issues: Variable path lengths Solution: inject buffers to equalize path lengths Done in two phases: Before mapping & Noc configuration– All the routes between each two connected nodes U and V are equalized by insertion of buffers After mapping & Noc configuration – The path length may be altered, the buffer lengths need recalibration * *B B - Buffer a b c x
14 November 11, 2015 ARCHITECTURE A Massively Multithreaded Dataflow Processor 14
15 November 11, 2015 Architecture overview Heterogeneous grid of tiles 1.Compute tiles: very similar to CUDA cores 2.LD/ST tiles: buffer and throttle data 3.Control tiles:pipeline buffering and join ops. 4.Special tiles:deal with non-pipelined operations Reference point: – A single grid is the equivalent of a single NVIDIA Streaming Multiprocessor (SM) – Total buffering capacity in SGMF is less than 30% of that of an NVIDIA Fermi register file 15
16 November 11, 2015 Architecture overview 16
17 November 11, 2015 Interconnect Switches are connected using a folded cube [Properties and performance of folded hypercubes., El-Amawy et al., IEEE TPDS 1991] 8 “almost-NN” Static Switching Determined at compile time 17
18 November 11, 2015 EVALUATION A Massively Multithreaded Dataflow Processor 18
19 November 11, 2015 Methodology The main HW blocks were Implemented in Verilog Synthesized to a 65nm process – Validate timing and connectivity – Estimate area and power consumption – The size of one SGMF core synthesized with 65nm process is 54.3mm 2 – When scaled down to 40nm, each SGMF core would occupy 21.18mm 2 – Nvidia Fermi GTX480 card (40nm) occupies 529mm 2 Cycle accurate simulations based on GPGPUSim – We Integrated synthesis results into the GPGPUSim/Wattch power model Benchmarks from Rodinia suite – CUDA kernels, compiled for SGMF 19
20 November 11, 2015 Single core system SGMF vs. Fermi – Performance
21 November 11, 2015 Single core system Energy savings 21
22 November 11, core system SGMF vs. Fermi – Performance 22
23 November 11, core system Energy savings 23
24 November 11, 2015 Conclusions von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing SGMF can potentially achieve much better performance/power than current GPGPUs – Almost 2 x speedup (average) and 50 % energy saving – Need to tune the memory system Greatly motivates further research – Compilation, place&route, connectivity, … 24
25 November 11, 2015 Thank you! Questions?