Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Part IV: Memory Management
1 Optimizing compilers Managing Cache Bercovici Sivan.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Team Morphing Architecture Reconfigurable Computational Platform for Space.
© 2004 Xilinx, Inc. All Rights Reserved Implemented by : Alon Ben Shalom Yoni Landau Project supervised by: Mony Orbach High speed digital systems laboratory.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Configurable System-on-Chip: Xilinx EDK
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.
Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Department of Electrical Engineering National Cheng Kung University
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Exploiting Parallelism
Register Transfer Languages (RTL)
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Cray XD1 Reconfigurable Computing for Application Acceleration.
Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz.
My Coordinates Office EM G.27 contact time:
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Applied Operating System Concepts
Application-Specific Customization of Soft Processor Microarchitecture
CSC 4250 Computer Architectures
ESE532: System-on-a-Chip Architecture
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Presented by: Isaac Martin
CSCI1600: Embedded and Real Time Software
High Level Synthesis Overview
Multivector and SIMD Computers
Operating System Concepts
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Optimizing stencil code for FPGA
Operating System Concepts
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department

2 Lithography Simulation (Application) u Simulation of the optical imaging process  Computational intensive and quite slow for full-chip simulation

3 Xtremedata Inc’s XD1000 TM Coprocessor System (Platform) u Socket-compatible : Replace one Opetron CPU with the XD1000 coprocessor Replace one Opetron CPU with the XD1000 coprocessor u The module connects to the CPU's HyperTransport bus and motherboard DIMMs while utilizing the existing power supply and heat sink solution for the CPU. u Dedicated DIMM for FPGA (not shared with CPU) u Coprocessor communicates with CPU via hyper-transport link, has similar behavior as a PCI device

4 Approach: Use of C to RTL Tools u Used two tools in our work  Codeveloper (Impulse C ) by Impulse Accelerated Technologies  AutoPilot by AutoESL Design Technologies u Advantages  Maintain the design at C level  Shorten the development cycle u Perform several tuning and refinement at C level Loop interchange, loop unrolling and loop pipelining Loop interchange, loop unrolling and loop pipelining Data distribution and memory partitioning Data distribution and memory partitioning Data prefetching / overlapping computation and communication Data prefetching / overlapping computation and communication

5 Imaging Equations I(x,y)image intensity at (x,y)  k (x,y)k th kernel  k (x,y)k th eigenvector (x 1,y 1 )(x 2, y 2 ) (x 1,y 2 ) (x 2,y 1 ) layout corners  mask transmittance Pseudo code of the Imaging Equation Loop over different rectangles Loop over kernels Loop over pixels

6 Loop Interchange Loop interchange Loop over pixels Loop over kernels Loop over layout corners Loop over kernels Loop over layout corners Loop over pixels u Different kernels do not have much correlation, thus put to the outer loop u Fix one specific layout corner, loop over pixels for more regular data access

7 Interpretation of Inner Loop after Loop Interchange Kernel Array Object (one rectangle) Image (partial sum) u Imaging equation:  The loop over different layout corners and pixels:  The partial image computed by the inner sum is the weighted sum of shifted kernel, and how much is shifted is determined by layout corners Layout corners

8 Loop Unrolling u Loop unrolling is one option to express parallelism in those tools u The improvement by loop unrolling is limited due to port conflicts  Data access of the same array cannot be scheduled to the same cycle due to port conflicts  May increase the initial interval when both loop pipelining and loop unrolling is used Loop unrolling

9 Further Parallelization needs Memory Partitioning u Unrolling did not solve the problem completely due to port conflictions u Need a multi-port (on-chip) mem with a large number of ports!  Implement the multi-port mem via memory partitioning u Computing tasks can be done in parallel once we get the multiple data in parallel  Each PE is responsible for computing one partition of image  Each PE composed of one partition of kernel and one partition of image partial sum  Multiplexing logic gets the data from different partitions of kernel and provides different partitions of kernel and provides the data for each PE the data for each PE  To compute one partition of image, might also need the kernel data in other partitions other partitions Kernel partition 1 Image Partial Sum partition 1 Computing Element Kernel partition 2 Image Partial Sum partition 2 Computing Element One partition of Kernel One partition of Image Partial Sum Computing Element Kernel partition 4 Image Partial Sum partition 4 Computing Element Multi plexing Logic 4-PE example Kernel partition 3 Image Partial Sum partition 3

10 Choosing Partitioning Schemes u A less optimal partitioning design ( here is 2 x 2 example)  Block scheduling to avoid the data access contention ( at any time each PE accesses a different kernel partition)  Might face load balancing problem if required kernel data lie mostly in some partitions  Computing tasks is partitioned into blocks/stages Using Kernel Partition 1 Compute Image Partition 1 Using Kernel Partition 2 Compute Image Partition 1 Using Kernel Partition 3 Compute Image Partition 1 Using Kernel Partition 4 Compute Image Partition 1 PE 1PE 2PE 3PE 4 Using Kernel Partition 2 Compute Image Partition 2 Using Kernel Partition 3 Compute Image Partition 2 Using Kernel Partition 4 Compute Image Partition 2 Using Kernel Partition 1 Compute Image Partition 2 Using Kernel Partition 3 Compute Image Partition 3 Using Kernel Partition 4 Compute Image Partition 3 Using Kernel Partition 1 Compute Image Partition 3 Using Kernel Partition 2 Compute Image Partition 3 Using Kernel Partition 4 Compute Image Partition 4 Using Kernel Partition 1 Compute Image Partition 4 Using Kernel Partition 2 Compute Image Partition 4 Using Kernel Partition 3 Compute Image Partition 4 Time

11 Choosing Partitioning Schemes (Cont) u u Data partitioning for load balancing   Here different colors different partitions   Memory banking using lower bits partition 1 partition 2 partition 3 partition 4 Kernel Array Image Partial Sum Array partition 1 partition 2 partition 3 partition 4

12 Address Generation and Data Multiplexing u Need Address Generation Logic to provide the address for the kernel data and image partial sum as the memory is partitioned u Need data multiplexing logic to deliver the data from multiple memory blocks to the correct place  Implemented as 2D ring based shifting (better than naïve Mux on larger partitioning ) Wanted : Reg_1=array_c[..] Reg_2=array_d[..] Reg_3=array_a[..] Reg_4=array_b[..] a d b c configuration 1configuration 2configuration 3configuration Start from: Reg_1=array_a[..] Reg_2=array_b[..] Reg_3=array_c[..] Reg_4=array_d[..] Reg_1Reg_2 Reg_3Reg_4 Shift 1 step in Y direction Shift 0 step in X direction

13 Loop Pipelining and Loop Unrolling u Loop pipelining can still be applied to the code after memory partitioning  Can speed up the code by a factor of 10X u Loop unrolling can be used to compact the code via multi-dimension array  One way to represent the memory partitioning kernel[size]; Loop body with unrolling pragma and pipelining pragma { …. +=kernel […]… //computation } kernel[4][4][size/16]; Loop body with unrolling pragma and pipelining pragma { …. +=kernel [i][j][…]… //if some index are constant }

14 Overlapping Computation and Communication u Use ping-pong buffers at Input and Output. u Two ways of implementation  Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C) Reading Input Data Computation Writing Output Data Reading Input Data Computation Writing Output Data Reading Input Data Computation Writing Output Data Reading Input Data Computation Writing Output Data DI1 DI2 Comp SW HW DI1 DI2 DO2 DO1 DI1 DI2 Comp DO2 DO1 DO2 DO1 DI1: Transferring Input From software to SRAM DI2: Transferring Input From SRAM to FPGA DO2: Transferring Output From FPGA to SRAM DO1: Transferring Output From SRAM to Software

15 Implementation Flow u Original code has nested loop u Loop interchange (manual code refinement) u Multi-PE implementation : add memory partitioning, address generation and data multiplexing logics (manual code refinement) u Enable loop pipelining for the refined code via specify pragmas u Use Impulse C and AutoPilot to compile the refined code u Use vendor tool to compile the RTL to bitstream u Run the program on the target system

16 Experiment Results u 15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM u Logic utilization around 25K ALUT (and 8K is used in the interface framework rather than design) u Power utilization less than 15W in FPGA comparing with 86W in Opteron248 u Close to 100X (5.8 x 15) improvement on energy efficiency  Assuming similar performance

17 Experience on the Two Commercial Tools u Impulse C  Strong platform customization support  Hardware software co-design  Smaller subset of C u Autopilot  Support for both C/C++/System C  Larger synthesizable subset  Platform customization

18 Discussions u The performance without different optimizations  Roughly 2~3X worse if we do not do memory partitioning u Polygon based versus image based approach  Image based is 2D FFT  Which one is faster depends on actual layout u Implementation on GPU  The nested loop itself is already data parallel  G80 has very fast shared mem for thread blocks. But the size is only 16KB.  We had to put the kernel array in the texture memory with caching

19 Acknowledgments u Financial supports from  GRC  GSRC(FCRP)  NSF u Industrial support and collaboration from  Altera-AMD-SUN-XDI consortium  Altera, Magma, and Xilinx under the UC MICRO program u Valuable discussion and comments from  Alfred Wong (Magma)  Zhiru Zhang (AutoESL)

20 Q/A