VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia 1

Motivation Embedded processing on FPGAs  High performance, computationally intensive  Soft processors, e.g. Nios/MicroBlaze, too slow How to deliver High Performance?  Multiprocessor on FPGA  Custom Hardware accelerators (Verilog RTL)  Synthesized accelerators (C to FPGA) 2

Motivation Soft vector processor to the rescue  Previous works have demonstrated soft vector processor as a viable option to provide: Scalable performance and area Purely software-based Decouples hardware/software development Key performance bottlenecks  Memory access latency  On-chip data storage efficiency 3

Contribution VEGAS Architecture key features  Cacheless Scratchpad Memory  Fracturable ALUs  Concurrent memory access via DMA Advantages  Eliminates on-chip data replication Also: huge # of vectors, long vector lengths  More parallel ALUs  Fewer memory loads/stores 4

VEGAS Architecture Scalar Core: NiosII/f @ 200MHz DMA Engine & External DDR2 Vector Core: VEGAS @ 120MHz Concurrent Execution FIFO synchronized 5

Scratchpad Memory in Action Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3 srcAsrcBDestsrcAsrcBDest 6

Scratchpad Memory in Action srcA Dest 7

Scratchpad Advantage Performance  Huge working set (256kB++)  Explicitly managed by software  Async load/store via concurrent DMA Efficient data storage 2x copies)  Double-clocked memory(Trad. RF 2x copies) 4x copies  8b data stays as 8b(Trad. RF 4x copies) +1 copy  No cache(Trad. RF +1 copy) 8

Scratchpad Advantage Accessed by address register  Huge # of vectors in scratchpad VEGAS uses only 8 vector addr. reg. (V0..V7) Modify content to access different vectors Auto-increment lessens need to change V0..V7  Long vector lengths Fill entire scratchpad 9

Scratchpad Advantage: Median Filter Vector address registers  easier than unrolling Traditional Vector Median Filter For J = 0..12 For I = J.. 24 V1 = vector[i]  vector load V2 = vector[j]  vector load CompareAndSwap( V1, V2 ) vector[j] = V2  vector store Vector[i] = V1  vector store Optimize away 1 vector load + 1 vector store using temp  Total of 222 loads and 222 stores 10

11 Scratchpad Advantage: Median Filter

Fracturable ALUs 12 Multiplier – uses 4 x 16b multipliers Multiplier also does shifts + rotate Adder – uses 4 x 8b adders

Fracturable ALUs Advantage Increased processing power  4-Lane VEGAS 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle  Median filter example 32b data: 184 cycles / pixel 16b data: 93 cycles / pixel 8b data: 47 cycles / pixel 13

Area and Frequency 14 Num. Lanes VEGAS ALMDSPM9KFmax 13831840131 248811240131 469762040130 8118243640125 16198436840122 323661113240116

ALM Usage 15

Performance 16 BenchmarkNiosII/fVEGASNiosII/V32 Speedup V1V32 fir509919855494693108x motest1668869825152471767x median13881857208x autocor12433845027282244x conven489883462189725x imgblend12311721758903548534x filt3x365565928134717534987x

Area-Delay Product Area*Delay measures “throughput per mm 2 ”  Compared to earlier vector processors, VEGAS offers 2-3x better throughput per unit area 17

Integer Matrix Multiply  4096 x 4096 integers (64MB data set) Intel Core 2 (65nm), 2.5GHz, 16GB DDR2  Vanilla IJK:474 seconds  Vanilla KIJ:134 s  Tiled IJK:93 s  Tiled KIJ:68 s VEGAS (65nm Altera Stratix3)  Vector:44 s(Nios only: 5407 s)  256kB Scratchpad, 32 Lanes (about 50% of chip)  200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM 18

19 Conclusions Vector processor  Purely software-based acceleration No hardware design / RTL recompile needed—just program  Faster chip design Can build vector processor before software algorithms finalized Simple programming model Maps well to FPGA  Many small memories, multiplier blocks  Should map well to ASIC

20 Conclusions Key features  Scratchpad Memory Enhance performance with fewer loads/stores No on-chip data replication; efficient storage Double-clocked to hide memory latency  Fracturable ALUs Operates on 8b, 16b, 32b data efficiently Single vector core accelerates many applications Result  2-3x better Area-Delay product than VIPERS/VESPA  Out performs Intel Core 2 at Integer Matrix Multiply

Issues / Future Work No floating-point yet  Adding “complex function” support, to include floating-point or similar operations Algorithms with only short vectors  Split vector processor into 2, 4, 8 pieces  Run multiple instances of algorithm Multiple vector processors  Connecting them to work cooperatively  Goals: increase throughput, exploit task-level parallelism (ie, chaining or pipelining) 21

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

Similar presentations

Presentation on theme: "VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

Similar presentations

Presentation on theme: "VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University."— Presentation transcript:

Similar presentations

About project

Feedback