Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally

HPCA-10NSJ2 Scaling Trends ILP increasingly harder and more expensive to extract CPU data courtesy of Francois Labonte, Stanford University Graphics processors exploit data parallelism NV10 NV35

HPCA-10NSJ3 Renewed Interest in Data Parallelism Data parallel application classes –Media, signal, network processing, scientific simulations, encryption etc. High-end vector machines –Have always been data parallel Academic research –Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. “Main-stream” industry –Sony Emotion Engine, Tarantula etc.

HPCA-10NSJ4 Storage Hierarchy Bandwidth taper Only supports sequential streams/vectors But many data parallel apps with –Data reorderings –Irregular data structures –Conditional accesses DRAM Stream/vector storage +x+x +x+x +x+x Cache

HPCA-10NSJ5 Sequential Streams/Vectors Inefficient  Evaluate arbitrary order access to streams Memory/cache Stream/vector storageCompute units a 00 a 01 a 02 a 03 a 10 a 11 a 12 a 13 a 20 a 21 a 22 a 23 a 30 a 31 a 32 a 33 Time Row major Column major b 13 b 12 b 11 b 10 b 03 b 02 b 01 b 00 b 33 a 13 a 12 a 11 a 10 a 03 a 02 a 01 a 00 a 33 c 00 c 01 c 02 c 03 c 10 c 11 c 12 c 13 c 20 c 21 c 22 c 23 c 30 c 31 c 32 c 33 c 31 c 21 c 11 c 01 c 30 c 20 c 10 c 00 c 33 b 00 b 01 b 02 b 03 b 10 b 11 b 12 b 13 b 20 b 21 b 22 b 23 b 30 b 31 b 32 b 33 b 31 b 21 b 11 b 01 b 30 b 20 b 10 b 00 b 33 Reorder

HPCA-10NSJ6 Outline Stream processing overview Applications Implementation Results Conclusion

HPCA-10NSJ7 Stream Programming Streams of records passing through compute kernels Parallelism –Across stream elements –Across kernels Locality –Within kernels –Between kernels FFT_stage in1 in2 Out Output

HPCA-10NSJ8 Bandwidth Hierarchy Stream programming is well matched to bandwidth hierarchy FFT_stage Memory Stream register file (SRF)Compute units Time

HPCA-10NSJ9 Stream Processors Several lanes –Execute in SIMD –Operate on records Inter-cluster network Compute cluster 0 SRF bank 0 Compute cluster (N-1) SRF bank (N-1) Inter-cluster network Lane 0 Memory system Memory switch

HPCA-10NSJ11 Stream-Level Data Reuse Sequential streams only capture in-order reuse Arbitrary access patterns in SRF capture more of available temporal locality Sequential (in-order) reuse e.g.: linear streams Non-sequential reuse Stream data reuse Reordered reuse e.g.: 2-D, 3-D accesses, multi-grid Intra-stream reuse e.g.: irregular neighborhoods, table lookups

HPCA-10NSJ12 Reordered Reuse Indexed SRF access eliminates reordering through memory

HPCA-10NSJ13 Reordered Reuse Reorder Indexed SRF access eliminates reordering through memory

HPCA-10NSJ14 Intra-stream Reuse Indexed SRF access eliminates –Replication in SRF –Redundant memory transfers Memory/cache Stream register file (SRF)Compute clusters Time A BD CBAD ADBCABDB Compute E FH C G GFEH Replicate

HPCA-10NSJ15 Intra-stream Reuse Memory/cache Stream register file (SRF)Compute clusters Time A BD ADBCABDB Compute E FH C G GFEH Replicate CBAD CBAD Indexed SRF access eliminates –Replication in SRF –Redundant memory transfers

HPCA-10NSJ16 Conditional Accesses Fine-grain conditional accesses –Expensive in SIMD architectures –Translate to conditional address computation

HPCA-10NSJ18 Base Architecture Each SRF bank accesses block of b contiguous words Compute cluster 0 SRF bank 0 Compute cluster (N-1) SRF bank (N-1) Inter-cluster network b*W

HPCA-10NSJ19 Indexed SRF Architecture Address path from clusters Lower indexed access bandwidth Compute cluster (N-1) SRF bank (N-1) Inter-cluster network Compute cluster 0 SRF bank 0 Address FIFOs

HPCA-10NSJ20 Base SRF Bank Several SRAM sub-arrays Each access is to one sub-array Compute cluster SRF bank Sub array 0 Sub array 1 Sub array 2 Local word -line drivers Sub array 3

HPCA-10NSJ21 Indexed SRF Bank Extra 8:1 mux at sub-array output –Allows 4x 1-word accesses Compute cluster SRF bank Sub array 1 Sub array 2 Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Sub array 3 mux Sub array 0

HPCA-10NSJ22 Cross-lane Indexed SRF Address switch added Inter-cluster network used for cross-lane SRF data Inter-cluster network SRF address network Compute cluster 0 SRF bank 0 Address FIFOs Compute cluster 0 SRF bank 0

HPCA-10NSJ23 Overhead - Area In-lane indexing overheads –11% over sequential SRF Per-sub-array independent addressing overheads Cross-lane indexing overheads –22% over sequential SRF Address switch 1.5% to 3% increase in die area (Imagine processor)

HPCA-10NSJ24 Overhead - Energy 0.1nJ (0.13  m) per indexed SRF access ~4x sequential SRF access > order of magnitude lower than DRAM access 0.25nJ per cache access Each indexed access replaces many SRF and DRAM/cache accesses

HPCA-10NSJ26 Benchmarks 64x64 2D FFT –2D accesses Rijndael (AES) –Table lookups Merge-sort –Fine-grain conditionals 5x5 convolution filter –Regular neighborhood Irregular graph –Irregular neighborhood access –Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense graph, Memory/Compute-limited, Short/Long strips

HPCA-10NSJ27 Machine Organizations Base (Sequential SRF) Compute clusters SRF banks Base + Cache DRAM Memory switch Inter-cluster net DRAM Memory switch Inter-cluster net Cache Indexed SRF SRF address net DRAM Memory switch Inter-cluster net

HPCA-10NSJ28 Machine Parameters BaseBase + cache Indexed SRF Technology 0.13  m 1GHz Compute8 compute clusters 32GFLOPs (peak) SRF128KB 128GB/s seq. 128KB 128GB/s seq 128GB/s in-lane 32GB/s x-lane Cache128KB 16GB/s DRAM9.14GB/s

HPCA-10NSJ29 Off-chip Memory Bandwidth

HPCA-10NSJ30 Off-chip Memory Bandwidth

HPCA-10NSJ31 Execution Time

HPCA-10NSJ33 Conclusions Data parallelism increasingly important Current data parallel architectures inefficient for some application classes –Irregular accesses Indexed SRF accesses –Reduce memory traffic –Reduce SRF data replication –Efficiently support complex/conditional stream accesses Performance improvements –3% to 410% for target application classes Low implementation overhead –1.5% to 3% die area

HPCA-10NSJ34 Backups

HPCA-10NSJ35 Indexed Access Instruction Overhead Excludes address issue instructions

HPCA-10NSJ36 Kernel C API while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c; } LUT.index << a; Indep. instructions; LUT >> b; 2 separate instructions –Address issue –Data read Address-data separation –May require loop unrolling, software pipelining etc.

HPCA-10NSJ37 Sensitivity to SRF Access Latency (1)

HPCA-10NSJ38 Sensitivity to SRF Access Latency (2)

HPCA-10NSJ39 Why Graphics Hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) *from Intel P4 Optimization Manual Pentium 4 NV30 NV35 Slide from Ian Buck, Stanford University

HPCA-10NSJ40 NVIDIA Graphics growth (225%/yr) 1: Dual textured 2: Programmable Essentially Moore’s Law Cubed. SeasonProductProcess# TransGflops32-bit AA FillMpolysNotes 2H97Riva 128.353M520M3M Integrated 2D/3D 1H98Riva ZX.255M731M3M AGP2x 2H98Riva TNT.257M1050M6M 32-bit 1H99TNT2.229M1575M9M AGP4x 2H99GeForce.2223M25120M15M HW T&L 1H00GF2 GTS.1825M35200M 1 25M Per-Pixel Shading 2H00GF2 Ultra.1825M45250M 1 31M 230 Mhz DDR 1H01GeForce3.1557M80500M 1 30M 2 Programmable Slide from Pat Hanrahan, Kurt Akeley

HPCA-10NSJ41 NVIDIA Historicals SeasonProductMT/sYr rateMF/sYr rate 2H97Riva 128 5 - 100 - 1H98Riva ZX 5 1.0 100 1.0 2H98Riva TNT 5 1.0 180 3.2 1H99Riva TNT2 8 1.0 333 3.4 2H99GeForce 15 3.5 480 2.1 1H00GeForce2 GTS 25 2.8 666 1.9 2H00GeForce2 Ultra 31 1.5 1000 2.3 1H01GeForce3 40 1.7 3200 10.2 1H02GeForce4 65 1.6 4800 1.5 1.82.4 Slide from Pat Hanrahan, Kurt Akeley

HPCA-10NSJ42 Base Architecture Stream buffers match SRF bandwidth to compute needs Stream buffers 32b 128b Compute cluster 0 SRF bank 0 32b 128b Compute cluster 7 SRF bank 7 Inter-cluster network

HPCA-10NSJ43 Indexed SRF Architecture Address path from clusters Lower indexed access bandwidth Stream buffers Address FIFOs Inter-cluster network Compute cluster 7 SRF bank 7 32b 128b Compute cluster 0 SRF bank 0

HPCA-10NSJ44 Base SRF Bank Several SRAM sub-arrays Sub array 1 Sub array 2 Sub array 3 Local WL drivers Sub array 0 256 128 Compute cluster SRF bank

HPCA-10NSJ45 Indexed SRF Bank Extra 8:1 mux at sub-array output –Allows 4x 1-word accesses Compute cluster SRF bank Sub array 1 Sub array 0 256 128 Sub array 2 Sub array 3 Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. 8:1 mux

HPCA-10NSJ46 Cross-lane Indexed SRF Address switch added Inter-cluster network used for cross-lane SRF data Stream buffers Address FIFOs Inter-cluster network Compute cluster 7 SRF bank 7 32b Compute cluster 0 SRF bank 0 SRF address network

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Similar presentations

Presentation on theme: "Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Similar presentations

Presentation on theme: "Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally."— Presentation transcript:

Similar presentations

About project

Feedback