IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 2 Outline Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions Double Precision Sparse Matrix-Vector Multiplication. Introduction Prasanna DeLorimier David Gregg et. al. What can we do ?
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 3 FPGA based Double Precision Dense Matrix-Matrix Multiplication.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 4 Motivation FPGAs have been making inroads for HiPC. Accelerating BLAS-3 achieved by accelerating matrix multiplications. Modern FPGAs provide an abundance of resources – We must capitalise upon these.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 5 Related Work{1/2} The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak. Dou : Optimised for a large VirtexII pro device (Xillinx). Created his own MAC (Not fully compliant). Sub-block dimensions must be powers of 2. Optimised for Low IO bandwidth.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 6 Related Work{2/2} Prasanna: Scaling results in speed degradation of about 35% (2 PEs to 20 PEs). 2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50). For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs. »They state they have not made any platform specific optimisations, for the implemented design.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 7 Algorithm 1.Broadcast ‘A’, keep a unique ‘B’ per PE 2.Multiply, and put in pipeline of multiplier. 3.Output is fed to directly to Adder+Ram (accumulator) 4.When the updated C is ready, take them out.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 8 Design-1
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 9 Design-II
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 10 FPGA Synthesis/PAR data{1/2} PEDSP48EsFIFOB RAMSlice RegSlice LUT (SX240) (SX240) Table: Clock Speed in MHz for the overall design for different number of PE. Device/PE SX95T SX240T Table: Resource Utilisation for SX95T and SX240T (post PAR)
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 11 FPGA Synthesis/PAR data{2/2} Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR) 15 PE20 PE MULT18x18240(54%)304(68%) RAMB16s90 (20%)114(26%) Slices30218 (68%)37023(83%) Speed MHz MHz
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 12 Conclusions We propose a variation of the rank one update algorithm for matrix multiplication. We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA The two designs clearly show the difference of local storage on IO bandwidth. The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 13 FPGA based Double Precision Sparse Matrix-Vector Multiplication.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 14 Introduction There are three main papers we will be looking at Viktor Prasanna : Hybrid method use HLL+S/W+HDL Michael DeLorimier : Maximum performance but unrealistic David Gregg et. al.: Most realistic assumptions wrt DRAM
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 15 Prasanna Use of prexisting IP cores – specifically for iterative solver (CG) 4 input reduction ckt does dot product results in partial sums as op. Adder loop with Array does summation of dotproduct – created using HLL Reduction ckt at the end uses B-Tree to create the final value IP s are available DRAM looked at – but not realistically Order of Matrices is small DRAM is bottleneck With their IP's they have a good architecture -however change the IP and modify datapath – eg. Dou MAC
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 16 DeLorimier Use BRAMs for everything. Use for iterative Solver – specifically CG MAC requires interleaving They do load balancing in their partitioner which requires – a communication stage, very matrix/partitioner dependent. Communication is the bottleneck Performance:750 MFLOPS / processor 16 Virtex II 6000s Each has 5 PE + 1 CE
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 17 David Gregg et. al. (SPAR) They only report the use of the SPAR architecture for FPGAs They use very pessimistic DRAM access times. Emphasis on cache-miss removal Not using their Block RAMs well – maybe something interesting can be done here 128 MFLOPS for 3 parallel SPAR units but remove cache miss and we get a peak of 570 MFLOPS
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 18 What can we do ? Both use CSR – Not required why not modify representation Two approaches : We can try both simultaneously Prasanna – split across dot products (same row many PE) Delorimier – split accross rows (many rows – one PE) Use data from SPAR – viable approach – both do zero multiplies – we get away with one zero multiply/coloumn Minimise communication or overlap it. - we can do interleaving for this – while one stage computes the previous one communicates.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 19 Questions ?
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 20 THANK YOU Thank You