Download presentation
Presentation is loading. Please wait.
Published byJeffery Barton Modified over 9 years ago
1
Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department
2
2 Lithography Simulation (Application) u Simulation of the optical imaging process Computational intensive and quite slow for full-chip simulation
3
3 Xtremedata Inc’s XD1000 TM Coprocessor System (Platform) u Socket-compatible : Replace one Opetron CPU with the XD1000 coprocessor Replace one Opetron CPU with the XD1000 coprocessor u The module connects to the CPU's HyperTransport bus and motherboard DIMMs while utilizing the existing power supply and heat sink solution for the CPU. u Dedicated DIMM for FPGA (not shared with CPU) u Coprocessor communicates with CPU via hyper-transport link, has similar behavior as a PCI device
4
4 Approach: Use of C to RTL Tools u Used two tools in our work Codeveloper (Impulse C ) by Impulse Accelerated Technologies AutoPilot by AutoESL Design Technologies u Advantages Maintain the design at C level Shorten the development cycle u Perform several tuning and refinement at C level Loop interchange, loop unrolling and loop pipelining Loop interchange, loop unrolling and loop pipelining Data distribution and memory partitioning Data distribution and memory partitioning Data prefetching / overlapping computation and communication Data prefetching / overlapping computation and communication
5
5 Imaging Equations I(x,y)image intensity at (x,y) k (x,y)k th kernel k (x,y)k th eigenvector (x 1,y 1 )(x 2, y 2 ) (x 1,y 2 ) (x 2,y 1 ) layout corners mask transmittance Pseudo code of the Imaging Equation Loop over different rectangles Loop over kernels Loop over pixels
6
6 Loop Interchange Loop interchange Loop over pixels Loop over kernels Loop over layout corners Loop over kernels Loop over layout corners Loop over pixels u Different kernels do not have much correlation, thus put to the outer loop u Fix one specific layout corner, loop over pixels for more regular data access
7
7 Interpretation of Inner Loop after Loop Interchange Kernel Array Object (one rectangle) Image (partial sum) + + - - u Imaging equation: The loop over different layout corners and pixels: The partial image computed by the inner sum is the weighted sum of shifted kernel, and how much is shifted is determined by layout corners Layout corners
8
8 Loop Unrolling u Loop unrolling is one option to express parallelism in those tools u The improvement by loop unrolling is limited due to port conflicts Data access of the same array cannot be scheduled to the same cycle due to port conflicts May increase the initial interval when both loop pipelining and loop unrolling is used Loop unrolling
9
9 Further Parallelization needs Memory Partitioning u Unrolling did not solve the problem completely due to port conflictions u Need a multi-port (on-chip) mem with a large number of ports! Implement the multi-port mem via memory partitioning u Computing tasks can be done in parallel once we get the multiple data in parallel Each PE is responsible for computing one partition of image Each PE composed of one partition of kernel and one partition of image partial sum Multiplexing logic gets the data from different partitions of kernel and provides different partitions of kernel and provides the data for each PE the data for each PE To compute one partition of image, might also need the kernel data in other partitions other partitions Kernel partition 1 Image Partial Sum partition 1 Computing Element Kernel partition 2 Image Partial Sum partition 2 Computing Element One partition of Kernel One partition of Image Partial Sum Computing Element Kernel partition 4 Image Partial Sum partition 4 Computing Element Multi plexing Logic 4-PE example Kernel partition 3 Image Partial Sum partition 3
10
10 Choosing Partitioning Schemes u A less optimal partitioning design ( here is 2 x 2 example) Block scheduling to avoid the data access contention ( at any time each PE accesses a different kernel partition) Might face load balancing problem if required kernel data lie mostly in some partitions Computing tasks is partitioned into blocks/stages Using Kernel Partition 1 Compute Image Partition 1 Using Kernel Partition 2 Compute Image Partition 1 Using Kernel Partition 3 Compute Image Partition 1 Using Kernel Partition 4 Compute Image Partition 1 PE 1PE 2PE 3PE 4 Using Kernel Partition 2 Compute Image Partition 2 Using Kernel Partition 3 Compute Image Partition 2 Using Kernel Partition 4 Compute Image Partition 2 Using Kernel Partition 1 Compute Image Partition 2 Using Kernel Partition 3 Compute Image Partition 3 Using Kernel Partition 4 Compute Image Partition 3 Using Kernel Partition 1 Compute Image Partition 3 Using Kernel Partition 2 Compute Image Partition 3 Using Kernel Partition 4 Compute Image Partition 4 Using Kernel Partition 1 Compute Image Partition 4 Using Kernel Partition 2 Compute Image Partition 4 Using Kernel Partition 3 Compute Image Partition 4 Time
11
11 Choosing Partitioning Schemes (Cont) u u Data partitioning for load balancing Here different colors different partitions Memory banking using lower bits partition 1 partition 2 partition 3 partition 4 Kernel Array Image Partial Sum Array partition 1 partition 2 partition 3 partition 4
12
12 Address Generation and Data Multiplexing u Need Address Generation Logic to provide the address for the kernel data and image partial sum as the memory is partitioned u Need data multiplexing logic to deliver the data from multiple memory blocks to the correct place Implemented as 2D ring based shifting (better than naïve Mux on larger partitioning ) Wanted : Reg_1=array_c[..] Reg_2=array_d[..] Reg_3=array_a[..] Reg_4=array_b[..] a d b c configuration 1configuration 2configuration 3configuration 4 1 4 2 3 Start from: Reg_1=array_a[..] Reg_2=array_b[..] Reg_3=array_c[..] Reg_4=array_d[..] Reg_1Reg_2 Reg_3Reg_4 Shift 1 step in Y direction Shift 0 step in X direction
13
13 Loop Pipelining and Loop Unrolling u Loop pipelining can still be applied to the code after memory partitioning Can speed up the code by a factor of 10X u Loop unrolling can be used to compact the code via multi-dimension array One way to represent the memory partitioning kernel[size]; Loop body with unrolling pragma and pipelining pragma { …. +=kernel […]… //computation } kernel[4][4][size/16]; Loop body with unrolling pragma and pipelining pragma { …. +=kernel [i][j][…]… //if some index are constant }
14
14 Overlapping Computation and Communication u Use ping-pong buffers at Input and Output. u Two ways of implementation Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C) Reading Input Data Computation Writing Output Data Reading Input Data Computation Writing Output Data Reading Input Data Computation Writing Output Data Reading Input Data Computation Writing Output Data DI1 DI2 Comp SW HW DI1 DI2 DO2 DO1 DI1 DI2 Comp DO2 DO1 DO2 DO1 DI1: Transferring Input From software to SRAM DI2: Transferring Input From SRAM to FPGA DO2: Transferring Output From FPGA to SRAM DO1: Transferring Output From SRAM to Software
15
15 Implementation Flow u Original code has nested loop u Loop interchange (manual code refinement) u Multi-PE implementation : add memory partitioning, address generation and data multiplexing logics (manual code refinement) u Enable loop pipelining for the refined code via specify pragmas u Use Impulse C and AutoPilot to compile the refined code u Use vendor tool to compile the RTL to bitstream u Run the program on the target system
16
16 Experiment Results u 15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM u Logic utilization around 25K ALUT (and 8K is used in the interface framework rather than design) u Power utilization less than 15W in FPGA comparing with 86W in Opteron248 u Close to 100X (5.8 x 15) improvement on energy efficiency Assuming similar performance
17
17 Experience on the Two Commercial Tools u Impulse C Strong platform customization support Hardware software co-design Smaller subset of C u Autopilot Support for both C/C++/System C Larger synthesizable subset Platform customization
18
18 Discussions u The performance without different optimizations Roughly 2~3X worse if we do not do memory partitioning u Polygon based versus image based approach Image based is 2D FFT Which one is faster depends on actual layout u Implementation on GPU The nested loop itself is already data parallel G80 has very fast shared mem for thread blocks. But the size is only 16KB. We had to put the kernel array in the texture memory with caching
19
19 Acknowledgments u Financial supports from GRC GSRC(FCRP) NSF u Industrial support and collaboration from Altera-AMD-SUN-XDI consortium Altera, Magma, and Xilinx under the UC MICRO program u Valuable discussion and comments from Alfred Wong (Magma) Zhiru Zhang (AutoESL)
20
20 Q/A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.