Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally
HPCA-10NSJ2 Scaling Trends ILP increasingly harder and more expensive to extract CPU data courtesy of Francois Labonte, Stanford University Graphics processors exploit data parallelism NV10 NV35
HPCA-10NSJ3 Renewed Interest in Data Parallelism Data parallel application classes –Media, signal, network processing, scientific simulations, encryption etc. High-end vector machines –Have always been data parallel Academic research –Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. “Main-stream” industry –Sony Emotion Engine, Tarantula etc.
HPCA-10NSJ4 Storage Hierarchy Bandwidth taper Only supports sequential streams/vectors But many data parallel apps with –Data reorderings –Irregular data structures –Conditional accesses DRAM Stream/vector storage +x+x +x+x +x+x Cache
HPCA-10NSJ5 Sequential Streams/Vectors Inefficient Evaluate arbitrary order access to streams Memory/cache Stream/vector storageCompute units a 00 a 01 a 02 a 03 a 10 a 11 a 12 a 13 a 20 a 21 a 22 a 23 a 30 a 31 a 32 a 33 Time Row major Column major b 13 b 12 b 11 b 10 b 03 b 02 b 01 b 00 b 33 a 13 a 12 a 11 a 10 a 03 a 02 a 01 a 00 a 33 c 00 c 01 c 02 c 03 c 10 c 11 c 12 c 13 c 20 c 21 c 22 c 23 c 30 c 31 c 32 c 33 c 31 c 21 c 11 c 01 c 30 c 20 c 10 c 00 c 33 b 00 b 01 b 02 b 03 b 10 b 11 b 12 b 13 b 20 b 21 b 22 b 23 b 30 b 31 b 32 b 33 b 31 b 21 b 11 b 01 b 30 b 20 b 10 b 00 b 33 Reorder
HPCA-10NSJ6 Outline Stream processing overview Applications Implementation Results Conclusion
HPCA-10NSJ7 Stream Programming Streams of records passing through compute kernels Parallelism –Across stream elements –Across kernels Locality –Within kernels –Between kernels FFT_stage in1 in2 Out Output
HPCA-10NSJ8 Bandwidth Hierarchy Stream programming is well matched to bandwidth hierarchy FFT_stage Memory Stream register file (SRF)Compute units Time
HPCA-10NSJ9 Stream Processors Several lanes –Execute in SIMD –Operate on records Inter-cluster network Compute cluster 0 SRF bank 0 Compute cluster (N-1) SRF bank (N-1) Inter-cluster network Lane 0 Memory system Memory switch
HPCA-10NSJ10 Outline Stream processing overview Applications Implementation Results Conclusion
HPCA-10NSJ11 Stream-Level Data Reuse Sequential streams only capture in-order reuse Arbitrary access patterns in SRF capture more of available temporal locality Sequential (in-order) reuse e.g.: linear streams Non-sequential reuse Stream data reuse Reordered reuse e.g.: 2-D, 3-D accesses, multi-grid Intra-stream reuse e.g.: irregular neighborhoods, table lookups
HPCA-10NSJ12 Reordered Reuse Indexed SRF access eliminates reordering through memory
HPCA-10NSJ13 Reordered Reuse Reorder Indexed SRF access eliminates reordering through memory
HPCA-10NSJ14 Intra-stream Reuse Indexed SRF access eliminates –Replication in SRF –Redundant memory transfers Memory/cache Stream register file (SRF)Compute clusters Time A BD CBAD ADBCABDB Compute E FH C G GFEH Replicate
HPCA-10NSJ15 Intra-stream Reuse Memory/cache Stream register file (SRF)Compute clusters Time A BD ADBCABDB Compute E FH C G GFEH Replicate CBAD CBAD Indexed SRF access eliminates –Replication in SRF –Redundant memory transfers
HPCA-10NSJ16 Conditional Accesses Fine-grain conditional accesses –Expensive in SIMD architectures –Translate to conditional address computation
HPCA-10NSJ17 Outline Stream processing overview Applications Implementation Results Conclusion
HPCA-10NSJ18 Base Architecture Each SRF bank accesses block of b contiguous words Compute cluster 0 SRF bank 0 Compute cluster (N-1) SRF bank (N-1) Inter-cluster network b*W
HPCA-10NSJ19 Indexed SRF Architecture Address path from clusters Lower indexed access bandwidth Compute cluster (N-1) SRF bank (N-1) Inter-cluster network Compute cluster 0 SRF bank 0 Address FIFOs
HPCA-10NSJ20 Base SRF Bank Several SRAM sub-arrays Each access is to one sub-array Compute cluster SRF bank Sub array 0 Sub array 1 Sub array 2 Local word -line drivers Sub array 3
HPCA-10NSJ21 Indexed SRF Bank Extra 8:1 mux at sub-array output –Allows 4x 1-word accesses Compute cluster SRF bank Sub array 1 Sub array 2 Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Sub array 3 mux Sub array 0
HPCA-10NSJ22 Cross-lane Indexed SRF Address switch added Inter-cluster network used for cross-lane SRF data Inter-cluster network SRF address network Compute cluster 0 SRF bank 0 Address FIFOs Compute cluster 0 SRF bank 0
HPCA-10NSJ23 Overhead - Area In-lane indexing overheads –11% over sequential SRF Per-sub-array independent addressing overheads Cross-lane indexing overheads –22% over sequential SRF Address switch 1.5% to 3% increase in die area (Imagine processor)
HPCA-10NSJ24 Overhead - Energy 0.1nJ (0.13 m) per indexed SRF access ~4x sequential SRF access > order of magnitude lower than DRAM access 0.25nJ per cache access Each indexed access replaces many SRF and DRAM/cache accesses
HPCA-10NSJ25 Outline Stream processing overview Applications Implementation Results Conclusion
HPCA-10NSJ26 Benchmarks 64x64 2D FFT –2D accesses Rijndael (AES) –Table lookups Merge-sort –Fine-grain conditionals 5x5 convolution filter –Regular neighborhood Irregular graph –Irregular neighborhood access –Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense graph, Memory/Compute-limited, Short/Long strips
HPCA-10NSJ27 Machine Organizations Base (Sequential SRF) Compute clusters SRF banks Base + Cache DRAM Memory switch Inter-cluster net DRAM Memory switch Inter-cluster net Cache Indexed SRF SRF address net DRAM Memory switch Inter-cluster net
HPCA-10NSJ28 Machine Parameters BaseBase + cache Indexed SRF Technology 0.13 m 1GHz Compute8 compute clusters 32GFLOPs (peak) SRF128KB 128GB/s seq. 128KB 128GB/s seq 128GB/s in-lane 32GB/s x-lane Cache128KB 16GB/s DRAM9.14GB/s
HPCA-10NSJ29 Off-chip Memory Bandwidth
HPCA-10NSJ30 Off-chip Memory Bandwidth
HPCA-10NSJ31 Execution Time
HPCA-10NSJ32 Outline Stream processing overview Applications Implementation Results Conclusion
HPCA-10NSJ33 Conclusions Data parallelism increasingly important Current data parallel architectures inefficient for some application classes –Irregular accesses Indexed SRF accesses –Reduce memory traffic –Reduce SRF data replication –Efficiently support complex/conditional stream accesses Performance improvements –3% to 410% for target application classes Low implementation overhead –1.5% to 3% die area
HPCA-10NSJ34 Backups
HPCA-10NSJ35 Indexed Access Instruction Overhead Excludes address issue instructions
HPCA-10NSJ36 Kernel C API while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c; } LUT.index << a; Indep. instructions; LUT >> b; 2 separate instructions –Address issue –Data read Address-data separation –May require loop unrolling, software pipelining etc.
HPCA-10NSJ37 Sensitivity to SRF Access Latency (1)
HPCA-10NSJ38 Sensitivity to SRF Access Latency (2)
HPCA-10NSJ39 Why Graphics Hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) *from Intel P4 Optimization Manual Pentium 4 NV30 NV35 Slide from Ian Buck, Stanford University
HPCA-10NSJ40 NVIDIA Graphics growth (225%/yr) 1: Dual textured 2: Programmable Essentially Moore’s Law Cubed. SeasonProductProcess# TransGflops32-bit AA FillMpolysNotes 2H97Riva M520M3M Integrated 2D/3D 1H98Riva ZX.255M731M3M AGP2x 2H98Riva TNT.257M1050M6M 32-bit 1H99TNT2.229M1575M9M AGP4x 2H99GeForce.2223M25120M15M HW T&L 1H00GF2 GTS.1825M35200M 1 25M Per-Pixel Shading 2H00GF2 Ultra.1825M45250M 1 31M 230 Mhz DDR 1H01GeForce3.1557M80500M 1 30M 2 Programmable Slide from Pat Hanrahan, Kurt Akeley
HPCA-10NSJ41 NVIDIA Historicals SeasonProductMT/sYr rateMF/sYr rate 2H97Riva H98Riva ZX H98Riva TNT H99Riva TNT H99GeForce H00GeForce2 GTS H00GeForce2 Ultra H01GeForce H02GeForce Slide from Pat Hanrahan, Kurt Akeley
HPCA-10NSJ42 Base Architecture Stream buffers match SRF bandwidth to compute needs Stream buffers 32b 128b Compute cluster 0 SRF bank 0 32b 128b Compute cluster 7 SRF bank 7 Inter-cluster network
HPCA-10NSJ43 Indexed SRF Architecture Address path from clusters Lower indexed access bandwidth Stream buffers Address FIFOs Inter-cluster network Compute cluster 7 SRF bank 7 32b 128b Compute cluster 0 SRF bank 0
HPCA-10NSJ44 Base SRF Bank Several SRAM sub-arrays Sub array 1 Sub array 2 Sub array 3 Local WL drivers Sub array Compute cluster SRF bank
HPCA-10NSJ45 Indexed SRF Bank Extra 8:1 mux at sub-array output –Allows 4x 1-word accesses Compute cluster SRF bank Sub array 1 Sub array Sub array 2 Sub array 3 Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. 8:1 mux
HPCA-10NSJ46 Cross-lane Indexed SRF Address switch added Inter-cluster network used for cross-lane SRF data Stream buffers Address FIFOs Inter-cluster network Compute cluster 7 SRF bank 7 32b Compute cluster 0 SRF bank 0 SRF address network