Latte: Locality Aware Transformation for High Level Synthesis

Slides:

Advertisements

Similar presentations

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Power Reduction for FPGA using Multiple Vdd/Vth

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Hybrid IP Lookup Architecture with Fast Updates Author : Layong Luo, Gaogang Xie, Yingke Xie, Laurent Mathy, Kavé Salamatian Conference: IEEE INFOCOM,

Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

® Virtex-E Extended Memory Technical Overview and Applications.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

CORDIC Based 64-Point Radix-2 FFT Processor

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk†

ESE532: System-on-a-Chip Architecture

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Presenter: Darshika G. Perera Assistant Professor

Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:

Jehandad Khan and Peter Athanas Virginia Tech

Backprojection Project Update January 2002

CS239-Lecture 4 FlumeJava Madan Musuvathi Visiting Professor, UCLA

Please do not distribute

Improved Resource Sharing for FPGA DSP Blocks

Floating-Point FPGA (FPFPGA)

Initial Experiences with Deploying FPGA Accelerators in Datacenters

Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication Speaker: Peipei Zhou Student: Peipei Zhou, Hyunseok Park, Zhenman Fang, Faculty:

Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of

SLP1 design Christos Gentsos 9/4/2014.

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

FPGA Acceleration of Convolutional Neural Networks

Instructor: Dr. Phillip Jones

Cache Memory Presentation I

Electronics for Physicists

FPGA Implementation of Multicore AES 128/192/256

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu

ECF: an MPTCP Scheduler to Manage Heterogeneous Paths

Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication Speaker: Peipei Zhou Student: Peipei Zhou, Hyunseok Park, Zhenman Fang, Faculty:

Classifying Race Conditions in Web Applications

2018/11/19 Source Routing with Protocol-oblivious Forwarding to Enable Efficient e-Health Data Transfer Author: Shengru Li, Daoyun Hu, Wenjian Fang and.

XC4000E Series Xilinx XC4000 Series Architecture 8/98

High Level Synthesis Overview

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Wavelet “Block-Processing” for Reduced Memory Transfers

Dynamic High-Performance Multi-Mode Architectures for AES Encryption

2018 NSF Expeditions in Computing PI Meeting

2018 NSF Expeditions in Computing PI Meeting

Programmable Logic- How do they do that?

Electronics for Physicists

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing

International Data Encryption Algorithm

ECE 352 Digital System Fundamentals

CS184a: Computer Architecture (Structure and Organization)

Peipei Zhou Ph.D. Final Defense June 10th, 2019

(Lecture by Hasan Hassan)

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Latte: Locality Aware Transformation for High Level Synthesis Speaker: Jie Wang Student: Peipei Zhou, Cody Hao Yu, Peng Wei Faculty: Jason Cong Hi Good morning everyone, My name is XXX, a PhD student from Prof. Jason Cong’s lab of UCLA. I am very happy today to give an introduction on our work “Latte: Locality Aware Transformation of High Level” Synthesis”.

Common Practice Accelerator Template When using HLS based FPGA designs, a common practice accelerator design template is shown here. Here we just show some latest publications that achieve success in accelerating a wide range of applications in using this template. And the list should continue to grow in the future. The accelerator has on-chip input buffer local_in and output buffer local_out. The accelerator works in batches or iterations. Let’s first focus on the first batch, shown in yellow blocks. In each batch, it reads in data from off-chip to on-chip buffers through AXI interface, processes in NumPE kernels in parallel, and then writes to off-chip from on-chip buffers. As you may already notice, since double buffer optimization is applied, off-chip communication in buffer_load and buffer_store can be overlapped with computation part among different batches, which are shown in different colors. Here, the number of Processing Elements (PE) controls the parallelism and determines the total resource area and also throughput of the accelerator. batch 0 batch 1 batch 2 batch 3 Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. Bandwidth Optimization Through On-Chip Buffer Restructuring for HLS. DAC 2017. Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu and Shaochong Zhang. Understanding Performance Differences of FPGAs and GPUs. FCCM 2018, Jason Cong, Peng Wei, Cody Hao Yu, Peng Zhang. Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture. DAC 2018.

Area Increases as Design Scales Out Matrix Multiplier #PE increases, what about F? Resource For example, as demonstrated by this paper in FCCM 2016, in Matrix Multiplication accelerator, when #PE increases from 1 to 2 to all the way to 64, the number of FF, LUT and DSPs increase. This means more chip resources are utilized and more parallelism is achieved. However, the throughput of an accelerator is also determined on the frequency of the accelerator after place & route. So, one key question is, when #PE increases, what about the frequency? #PE Peipei Zhou, Hyunseok Park, Zhenman Fang, Jason Cong, Andre DeHon. Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication. The 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2016) , Washington DC, May 1-3, 2016.

Frequency Degradation when Design Scales Out in HLS-based Accelerator 30% -> 200MHz 90% -> 130MHz NW KMP Viterbi Frequency (MHz) SPMV AES Stencil MM FFT: 88%-> 57MHz Actually we observe a severe frequency degradation problem when design scales out in HLS based accelerator. For example, we plot the chip area percentage and the frequency after implementation here in the graph. We tested for eight applications, and for every one of them, when area increases, frequency drops significantly. We also plot a trendline here and it shows that on average, HLS generated accelerators sustain a 200 MHz on 30% resource usage, however, it drops to only 130 MHz when usage is up to 90%. One extreme case is FFT design, that operates on only 57MHz when usage is 88%. [Favorite Gavin!!] Why does this happen? Where is the critical path? Are there any common reasons to explain these different applications? The answer is Yes! FFT Area (Xilinx Virtex-7 XC7VX690T)

Why Frequency Degradation? Scatter We investigate on the critical path carefully and let me explain now. If we look again on the buffer_load in the accelerator template again, we can see that, data is scheduled from the AXI read port to each BRAM bank, This means, one AXI port needs to be routed to all BRAM banks. We call it one-to-all scatter pattern. Similar situation happens in buffer_store, where all output BRAM banks need to be routed to the AXI port through a big MUX, we call it all-to-one gather pattern. And the critical paths lie in the long wires in scatter and gather patterns. You may wonder how do they look like in the real chip layout? Let me show you! Gather batch 0

Achilles’ Heel: Scatter, Gather, Broadcast, Reduce *broadcast and reduce are similar to scatter, gather patterns Here you are. The yellow highlight area are the scattered on-chip BRAM blocks. In the center of this picture is the data port in AXI logic. It’s not hard to imagine the wirelength of these wires scale up as design scales out. Actually, there are other two patterns, broadcast and reduce, that are similar to scatter and gather patterns. These pattern, we called them Achilles’ Heel in HLS, because they happen in most, if not all, accelerators. In accelerators that we studied, out of 8 of them, in 6 designs, a critical path lie in the four patterns.

Latte: Locality Aware Transformation for high-level synThEsis Latte microarchitecture: pipelined transfer controllers (PTC) to reduce critical path PTCs are chained through FIFOs Each PTC connects to a local set of buffers Improves a lot in frequency: 1.50x with 3.2% LUT, 5.1% FF on average 2.66x with 2.7% LUT, 5.1% FF for FFT 10 LOC: Latte Pragma To alleviate the pain HLS design suffer, we propose Latte, Latte comes from Locality Aware Transformation for high-level synthesis. The core part of Latte microarchitecture is PTC, pipelined transfer controllers, inserted into long wires in the four patterns. PTCs are chained through FIFOs to cap the growth of wire delay. And Each PTC connects to only local set of buffers. Thus wire delays in the critical paths in the four patterns can be reduced. In short, Latte boosts a lot in frequency, for example, 1.5x with only 3.2% LUT area overhead on average. If you want to know how we can use only 10 lines of latte pragma to achieve this and more details of Latte, Please come to the poster session.

Welcome to Poster Session P3, 10:10-11:15 This is an overview of the Latte poster, I hope you will enjoy! Thank you!