Download presentation
Presentation is loading. Please wait.
Published byOsbaldo Munoz Modified over 10 years ago
1
Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia Prepared for FPGA2008, Altera, and Xilinx February 26-28, 2008
2
2 Motivation FPGAs for embedded processing High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator
3
3 Problems… Faster Nios/MicroBlaze not feasible 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Superscalar complex dependency checking Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function
4
4 Possible Solutions… Automatically synthesized hardware accelerators Change software regenerate & recompile RTL Altera C2H Xilinx CHiMPS Mitrion Virtual Processor CriticalBlue Cascade Soft vector processor Change software same RTL, just recompile software Purely software-based Decouples hardware/software development teams
5
5 Advantages of Vector Processing Simple programming model Short to long vector data parallelism Regular, easy to accelerate Purely software-based One hardware accelerator supports many applications Scalable performance and area
6
6 Contributions Configurable soft vector processor Selectable performance/resource tradeoff Area customization FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory
7
Overview of Vector Processing
8
8 Acceleration with Vector Processing Organize data as long vectors Data-level parallelism Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation over length of vector Sourcevectorregisters Destinationvectorregister Vector lanes for (i=0; i<NELEM; i++) a[i] = b[i] * c[i] vmult a, b, c
9
9 Compared to CPUs with SIMD Extensions Intel SSE2, PowerPC Altivec, etc Short, fixed-length vectors (eg, 4) Single cycle per instruction Many data pack/unpack instructions SourceSIMDregistersDestinationSIMDregister SIMD Unit
10
10 Hybrid Vector-SIMD Consider the code sequence Traditional Vector Hybrid Vector-SIMDSIMD For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } Loop iteration
11
11 Hybrid vector-SIMD vs Traditional Vector Traditional vector processing Hybrid Vector-SIMD processing For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } 0 1 2 3 C E C E 4 5 6 7
12
12 Vector ISA Features Vector length (VL) register Conditional execution Vector flag registers Vector addressing modes Unit stride Constant stride Indexed offset Source registers Destination register Flag register Vector Merge Operation
13
13 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i to 24 { if (P[j] < minimum) { swap (minimum, P[j]) } Slide “window” over after 1 median Repeated over entire image Many windows Output pixel
14
14 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i to 24 { if (P[j] < minimum) { swap (minimum, P[j]) } Bubble sort on vector registers Vector flag register to mask execution “VL” results at once! 25 rows -> 25 vector registers “VL” pixels each
15
Soft Vector Processor Architecture
16
16 Nios II core Shared instruction memory (scalar / vector instructions) Shared scalar / vector Memory interface Distributed vector register file Overlapped scalar / vector execution Configurable memory width Configurable number of lanes
17
17 0 0 1 1 3 3 4 4 5 5 7 7 One vector Register (eg, v0) Distributed vector register file
18
18 Local vector datapath memory MAC chain Result to VLane 0
19
19 Vector Sum Reduction with MAC Sum reduction R = A[i] * B[i] R = A[i] (using B[i] = 1) Reduces VL elements in vector register to single number Two instruction sequence: vmac multiply accum. to accumulators vcczacc compress copy and zero accumulators Side effect: can only reduce 18-bit inputs Accumulate chain
20
20 Configurable Parameters Some configurable features Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none) Strongly affect performance, area
21
21 Partial List of Configurable Parameters Primary Parameters Soft vector processors ParameterDescriptionTypicalV4V8V16M 32 NLaneNumber of vector lanes4-1284816 MVLMaximum vector length16-512163264 VPUWProcessor data width (bits)8, 16, 3232 MemMinWidthMinimum accessible data width in memory 8, 16, 328832 Parameters for Optional Features MultWMultiplier width (bits, 0 is off)0, 8, 16, 3216 MACLMAC chain length (0 is no MAC)0,1,2,4120 LMemNLocal memory number of words0-1024256 0 LMemShareShared local memory address space within lane On/OffOff
22
Performance Results
23
23 Benchmarking 3 sample application kernels 5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench) C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator
24
24 Methodology and Assumptions Compile C code with nios2-gcc Run time Instructions * cycles-per-instruction / Fmax Nios II Instruction: 1 cycle Memory load: 1 cycle Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles
25
25 Altera C2H Compiler Nios II with C2H accelerator Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations: No automatic loop unrolling Up to user to efficiently partition memory Memory Arbiter Avalon Fabric
26
26 C2H Methodology Compile application kernels with C2H compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead
27
27 C2H Example AES encryption round Shift 4 32-bit words (by different amounts) 4 table lookups XOR results, XOR with key Acceleration steps 1. Process multiple blocks in parallel (increase array sizes) 2. Manually create 4 on-chip memories for 4 lookup tables 32-bit word
28
28 Synthesize system, place and route Synthesize system, place and route/
29
29 Resource Utilization Biggest Stratix III = 7x more resources Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.
30
30 Resource Utilization Estimates ALMDSP ElementsM9KFmax Smallest Stratix III19000216108- Nios II/s48984153 + C2H Median filtering82584*147 + C2H Motion estimation977104*135 + C2H AES encryption248086*119 UTIIe32403193 +V452152132115 +V870113453114 +V16102665895113 * C2H results are obtained from compiling to Stratix II; uses M4K memories
31
31 Results: Clock Cycles
32
32 Speedup vs Resource Utilization Summary Nios II/s V16 V32 C2H Vector Median filtering AES encryption Motion estimation
33
33 Summary of Effort C2H accelerators 1. “Vectorize” code for C2H: 1 day 2. Extra-effort optimization: 1 day 3. Place-and-route waiting: 1 hour Each iteration = 1 day + P&R Vector soft processor 1. Vector algorithm, write vector assembly: 2 days 2. Revise vector algorithm: 0.5 day Each iteration = 0.5 day + SW compile only
34
34 Lessons from Vector Processor Design Register files 2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor (128b, byte access) Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful Would like 32-bit MAC chain
35
35 Current Progress Development toolchain integration Packaged as SOPC builder component No built-in debug core Uses real Nios II processor to download code on to system Inline vector assembly in Nios II IDE Future work Compiler Floating-point
36
36 Conclusion Vector processing maps well to FPGA Many small memories, DSP blocks Simple programming model Soft vector processor Purely software-based acceleration No hardware design / RTL recompile needed—just program One hardware accelerator supports many applications Scalable performance and area More vector lanes more performance for more area Soft core parameters/features area customization
37
37 Conclusion FPGA-specific enhancements Partitioned register file reduces resource utilization MAC chain for efficient vector reduction Local vector datapath memory Table lookup operations Download the processor now! http://www.ece.ubc.ca/~jasony/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.