Download presentation
Presentation is loading. Please wait.
Published byStuart Atkins Modified over 9 years ago
1
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia 1
2
Motivation Embedded processing on FPGAs High performance, computationally intensive Soft processors, e.g. Nios/MicroBlaze, too slow How to deliver High Performance? Multiprocessor on FPGA Custom Hardware accelerators (Verilog RTL) Synthesized accelerators (C to FPGA) 2
3
Motivation Soft vector processor to the rescue Previous works have demonstrated soft vector processor as a viable option to provide: Scalable performance and area Purely software-based Decouples hardware/software development Key performance bottlenecks Memory access latency On-chip data storage efficiency 3
4
Contribution VEGAS Architecture key features Cacheless Scratchpad Memory Fracturable ALUs Concurrent memory access via DMA Advantages Eliminates on-chip data replication Also: huge # of vectors, long vector lengths More parallel ALUs Fewer memory loads/stores 4
5
VEGAS Architecture Scalar Core: NiosII/f @ 200MHz DMA Engine & External DDR2 Vector Core: VEGAS @ 120MHz Concurrent Execution FIFO synchronized 5
6
Scratchpad Memory in Action Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3 srcAsrcBDestsrcAsrcBDest 6
7
Scratchpad Memory in Action srcA Dest 7
8
Scratchpad Advantage Performance Huge working set (256kB++) Explicitly managed by software Async load/store via concurrent DMA Efficient data storage 2x copies) Double-clocked memory(Trad. RF 2x copies) 4x copies 8b data stays as 8b(Trad. RF 4x copies) +1 copy No cache(Trad. RF +1 copy) 8
9
Scratchpad Advantage Accessed by address register Huge # of vectors in scratchpad VEGAS uses only 8 vector addr. reg. (V0..V7) Modify content to access different vectors Auto-increment lessens need to change V0..V7 Long vector lengths Fill entire scratchpad 9
10
Scratchpad Advantage: Median Filter Vector address registers easier than unrolling Traditional Vector Median Filter For J = 0..12 For I = J.. 24 V1 = vector[i] vector load V2 = vector[j] vector load CompareAndSwap( V1, V2 ) vector[j] = V2 vector store Vector[i] = V1 vector store Optimize away 1 vector load + 1 vector store using temp Total of 222 loads and 222 stores 10
11
11 Scratchpad Advantage: Median Filter
12
Fracturable ALUs 12 Multiplier – uses 4 x 16b multipliers Multiplier also does shifts + rotate Adder – uses 4 x 8b adders
13
Fracturable ALUs Advantage Increased processing power 4-Lane VEGAS 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle Median filter example 32b data: 184 cycles / pixel 16b data: 93 cycles / pixel 8b data: 47 cycles / pixel 13
14
Area and Frequency 14 Num. Lanes VEGAS ALMDSPM9KFmax 13831840131 248811240131 469762040130 8118243640125 16198436840122 323661113240116
15
ALM Usage 15
16
Performance 16 BenchmarkNiosII/fVEGASNiosII/V32 Speedup V1V32 fir509919855494693108x motest1668869825152471767x median13881857208x autocor12433845027282244x conven489883462189725x imgblend12311721758903548534x filt3x365565928134717534987x
17
Area-Delay Product Area*Delay measures “throughput per mm 2 ” Compared to earlier vector processors, VEGAS offers 2-3x better throughput per unit area 17
18
Integer Matrix Multiply 4096 x 4096 integers (64MB data set) Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 Vanilla IJK:474 seconds Vanilla KIJ:134 s Tiled IJK:93 s Tiled KIJ:68 s VEGAS (65nm Altera Stratix3) Vector:44 s(Nios only: 5407 s) 256kB Scratchpad, 32 Lanes (about 50% of chip) 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM 18
19
19 Conclusions Vector processor Purely software-based acceleration No hardware design / RTL recompile needed—just program Faster chip design Can build vector processor before software algorithms finalized Simple programming model Maps well to FPGA Many small memories, multiplier blocks Should map well to ASIC
20
20 Conclusions Key features Scratchpad Memory Enhance performance with fewer loads/stores No on-chip data replication; efficient storage Double-clocked to hide memory latency Fracturable ALUs Operates on 8b, 16b, 32b data efficiently Single vector core accelerates many applications Result 2-3x better Area-Delay product than VIPERS/VESPA Out performs Intel Core 2 at Integer Matrix Multiply
21
Issues / Future Work No floating-point yet Adding “complex function” support, to include floating-point or similar operations Algorithms with only short vectors Split vector processor into 2, 4, 8 pieces Run multiple instances of algorithm Multiple vector processors Connecting them to work cooperatively Goals: increase throughput, exploit task-level parallelism (ie, chaining or pipelining) 21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.