Efficient Communication Hardware Accelerators and PS

Slides:

Advertisements

Similar presentations

Bus Specification Embedded Systems Design and Implementation Witawas Srisa-an.

Advertisements

6-April 06 by Nathan Chien. PCI System Block Diagram.

Computer Architecture

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

System Integration and Performance

I/O Organization popo.

Categories of I/O Devices

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

INPUT-OUTPUT ORGANIZATION

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

Computer Science & Engineering

MICROPROCESSORS TWO TYPES OF MODELS ARE USED :  PROGRAMMER’S MODEL :- THIS MODEL SHOWS FEATURES, SUCH AS INTERNAL REGISTERS, ADDRESS,DATA & CONTROL BUSES.

Programmable Interval Timer

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.

Chapter 4 Conventional Computer Hardware Architecture

Reporter:PCLee With a significant increase in the design complexity of cores and associated communication among them, post-silicon validation.

KeyStone Training Multicore Navigator Overview. Overview Agenda What is Navigator? – Definition – Architecture – Queue Manager Sub-System (QMSS) – Packet.

ECE 699: Lecture 6 Using DMA & AXI4-Stream.

General Purpose Input Output GPIO ECE 699: Lecture 3.

Input-output and Communication Prof. Sin-Min Lee Department of Computer Science.

Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.

© 2006 Pearson Education, Upper Saddle River, NJ All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 13 Direct Memory Access (DMA)

Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.

Double buffer SDRAM Memory Controller Presented by: Yael Dresner Andre Steiner Instructed by: Michael Levilov Project Number: D0713.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

INPUT-OUTPUT ORGANIZATION

General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation

Hardware Overview Net+ARM – Well Suited for Embedded Ethernet

Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf

Project – Video manipulator (based on Zed Board) Final presentation

Temperature Variation Aware Energy Optimization in Heterogeneous MPSoCs Mohammadsadegh Sadri Department of Electrical, Electronic and Information Engineering.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Synchronization and Communication in the T3E Multiprocessor.

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

Samsung ARM S3C4510B Product overview System manager

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.

(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:

Computer Organization CT213 – Computing Systems Organization.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Single and 32 Channel HDLC Controllers File Number Here ® LogiCORE Products.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.

Embedded Network Interface (ENI). What is ENI? Embedded Network Interface Originally called DPO (Digital Product Option) card Printer without network.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

© 2009, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the serial communication.

EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture

IT3002 Computer Architecture

LAB 3 – Synchronous Serial Port Design Using Verilog

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.

Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz.

ECE 699: Lecture 6 AXI Interfacing Using DMA & AXI4-Stream.

Design with Vivado IP Integrator

ECE 699: Lecture 2 Introduction to Zynq.

General Purpose Input Output GPIO ECE 699: Lecture 4.

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Direct Memory address and 8237 dma controller LECTURE 6

CS 286 Computer Organization and Architecture

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

FPro Bus Protocol and MMIO Slot Specification

Presentation transcript:

Efficient Communication Hardware Accelerators and PS ECE 699: Lecture 7 Efficient Communication Between Hardware Accelerators and PS

Recommended Videos & Slides M.S. Sadri, ZYNQ Training Lesson 12 – AXI Memory Mapped Interfaces and Hardware Debugging Lesson 7 – AXI Stream Interface In Detail (RTL Flow) Lesson 9 – Software development for ZYNQ using Xilinx SDK (Transfer data from ZYNQ PL to PS) Xilinx Advanced Embedded System Design on Zynq Memory Interfacing (see Resources on Piazza)

Recommended Paper & Slides M. Sadri, C. Weis, N. Wehn, and L. Benini, “Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ,” Proc. 10th FPGAworld Conference, Stockholm 2013, available at http://www.googoolia.com/wp/2014/03/07/my-cv/

Mapping of an Embedded SoC Hardware Architecture to Zynq Source: Xilinx White Paper: Extensible Processing Platform

Simple Custom Peripheral Source: M.S. Sadri, Zynq Training

Simple Custom Accelerator Source: M.S. Sadri, Zynq Training

Example of a Custom Accelerator Source: M.S. Sadri, Zynq Training

Block Diagram of the Pattern Counter Source: M.S. Sadri, Zynq Training

Ways of Implementing AXI4 Slave Units Source: M.S. Sadri, Zynq Training

Pixel Processing Engine Source: M.S. Sadri, Zynq Training

PS-PL Interfaces and Interconnects Source: The Zynq Book

General-Purpose Port Summary GP ports are designed for maximum flexibility Allow register access from PS to PL or PL to PS Good for Synchronization Prefer ACP or HP port for data transport

High-Performance Port Summary HP ports are designed for maximum bandwidth access to external memory and OCM When combined can saturate external memory and OCM bandwidth – HP Ports : 4 * 64 bits * 150 MHz * 2 = 9.6 GByte/sec – external DDR: 1 * 32 bits * 1066 MHz * 2 = 4.3 GByte/sec – OCM : 64 bits * 222 MHz * 2 = 3.5 GByte/sec Optimized for large burst lengths and many outstanding transactions Large data buffers to amortize access latency Efficient upsizing/downsizing for 32 bit accesses

Using Central DMA Source: M.S. Sadri, Zynq Training

Central DMA High-bandwidth Direct Memory Access (DMA) between a memory-mapped source address and a memory-mapped destination address Optional Scatter Gather (SG) Initialization, status, and control registers are accessed through an AXI4-Lite slave interface Source: Xilinx Advanced Embedded System Design on Zynq

Using Central DMA in the Scatter-Gather Mode Source: M.S. Sadri, Zynq Training

Scatter Gather DMA Mode Source: Symbian OS Internals/13. Peripheral Support

Custom Accelerator with the Master AXI4 Interface Source: M.S. Sadri, Zynq Training

Ways of Implementing AXI4 Master Units Source: M.S. Sadri, Zynq Training

AXI4-Full Source: M.S. Sadri, Zynq Training

Image Rotation Unit Source: M.S. Sadri, Zynq Training

FFT Unit Source: M.S. Sadri, Zynq Training

Sample Generator Source: M.S. Sadri, Zynq Training

PL-PS Interfaces Source: M.S. Sadri, Zynq Training

Accelerator Architecture with DMA Source: Building Zynq Accelerators with Vivado HLS, FPL 2013 Tutorial

AXI DMA-based Accelerator Communication Write to Accelerator processor allocates buffer processor writes data into buffer processor flushes cache for buffer processor initiates DMA transfer Read from Accelerator processor waits for DMA to complete processor invalidates cache for buffer processor reads data from buffer

Flushing and Invalidating Cache /* Flush the SrcBuffer before the DMA transfer */ Xil_DCacheFlushRange((u32)TxBufferPtr, BYTES_TO_SEND); . . . . . . . . /* Invalidate the DstBuffer after the DMA transfer */ Xil_DCacheInvalidateRange((u32)RxBufferPtr, BYTES_TO_RCV);

Programming Sequence for MM2S channel (1) Simple DMA Transfer Programming Sequence for MM2S channel (1) Start the MM2S channel running by setting the run/stop bit to 1, MM2S_DMACR.RS = 1. If desired, enable interrupts by writing a 1 to MM2S_DMACR.IOC_IrqEn and MM2S_DMACR.Err_IrqEn. Write a valid source address to the MM2S_SA register. Write the number of bytes to transfer in the MM2S_LENGTH register. The MM2S_LENGTH register must be written last. All other MM2S registers can be written in any order.

Programming Sequence for S2MM channel (1) Simple DMA Transfer Programming Sequence for S2MM channel (1) Start the S2MM channel running by setting the run/stop bit to 1, S2MM_DMACR.RS = 1. If desired, enable interrupts by by writing a 1 to S2MM_DMACR.IOC_IrqEn and S2MM_DMACR.Err_IrqEn. Write a valid destination address to the S2MM_DA register. Write the length in bytes of the receive buffer in the S2MM_LENGTH register. The S2MM_LENGTH register must be written last. All other S2MM registers can be written in any order.

Transmitting and Receiving a Packet Using High-Level Functions /* Transmit a packet */ Status = XAxiDma_SimpleTransfer(&AxiDma,(u32) TxBufferPtr, BYTES_TO_SEND, XAXIDMA_DMA_TO_DEVICE); if (Status != XST_SUCCESS) { return XST_FAILURE; } while (!TxDone); . . . . . . /* Receive a packet */ Status = XAxiDma_SimpleTransfer(&AxiDma,(u32) RxBufferPtr, BYTES_TO_RCV, XAXIDMA_DEVICE_TO_DMA); while (!RxDone);

Using Lower-Level Functions Transmitting a Packet Using Lower-Level Functions /* Transmit a packet */ Xil_Out32(AxiDma.TxBdRing.ChanBase + XAXIDMA_SRCADDR_OFFSET, (u32) TxBufferPtr); Xil_Out32(AxiDma.TxBdRing.ChanBase + XAXIDMA_CR_OFFSET, Xil_In32(AxiDma.TxBdRing.ChanBase +XAXIDMA_CR_OFFSET) | XAXIDMA_CR_RUNSTOP_MASK); Xil_Out32(AxiDma.TxBdRing.ChanBase + XAXIDMA_BUFFLEN_OFFSET, BYTES_TO_SEND); while (TxDone == 0);

Using Lower-Level Functions Receiving a Packet Using Lower-Level Functions /* Receive a packet */ Xil_Out32(AxiDma.RxBdRing.ChanBase + XAXIDMA_DESTADDR_OFFSET, (u32) RxBufferPtr); Xil_Out32(AxiDma.RxBdRing.ChanBase+XAXIDMA_CR_OFFSET, Xil_In32(AxiDma.RxBdRing.ChanBase+XAXIDMA_CR_OFFSET) | XAXIDMA_CR_RUNSTOP_MASK); Xil_Out32(AxiDma.RxBdRing.ChanBase + XAXIDMA_BUFFLEN_OFFSET, BYTES_TO_RCV); while (RxDone == 0);

PL-PS Interfaces Source: M.S. Sadri, Zynq Training

Accelerator Architecture with Coherent DMA Source: Building Zynq Accelerators with Vivado HLS, FPL 2013 Tutorial

Coherent AXI DMA-based Accelerator Communication Write to Accelerator processor allocates buffer processor writes data into buffer processor flushes cache for buffer processor initiates DMA transfer Read from Accelerator processor waits for DMA to complete processor invalidates cache for buffer processor reads data from buffer

Accelerator Coherency Port (ACP) Summary ACP allows limited support for Hardware Coherency – Allows a PL accelerator to access cache of the Cortex-A9 processors – PL has access through the same path as CPUs including caches, OCM, DDR, and peripherals – Access is low latency (assuming data is in processor cache) no switches in path ACP does not allow full coherency – PL is not notified of changes in processor caches – Use write to PL register for synchronization ACP is compromise between bandwidth and latency – Optimized for cache line length transfers – Low latency for L1/L2 hits – Minimal buffering to hide external memory latency – One shared 64 bit interface, limit of 8 masters

AXI-based DMA Services Four AXI-based DMA services are provided Central DMA (CDMA) Memory-to-memory operations DMA Memory to/from AXI stream peripherals FIFO Memory Mapped To Streaming Streaming AXI interface alternative to traditional DMA Video DMA Optimized for streaming video application to/from memory Source: Xilinx Advanced Embedded System Design on Zynq

Streaming FIFO Source: Xilinx Advanced Embedded System Design on Zynq

Streaming FIFO General AXI interconnect has no support for the AXI stream interface axi_fifo_mm_s provides this facility FIFO included Added as all other types of IP are from the IP Catalog Features AXI4/AXI4-Lite slave interface Independent internal 512B-128KB TX and RX data FIFOs Full duplex operation Source: Xilinx Advanced Embedded System Design on Zynq

Streaming FIFO Slave AXI connection RX/TX FIFOs Interrupt controller Control registers Three user-side AXI Stream interfaces TX data RX data TX control

AXI Video DMA Controller Source: Xilinx Advanced Embedded System Design on Zynq

Design Goal Hardware accelerator capable of working for arbitrary values of parameters lm, ln, lp, defined in software, with the only limitations imposed by the total size and the word size of internal memories.

Passing Parameters to an Accelerator Option 1: Parameters (e.g., lm, ln, lp) are passed using AXI_Lite Option 2: Parameters (e.g., lm, ln, lp) are passed in the header of input data Option 3: Parameters inferred from the size of transmitted input data (not possible in general case of matrix multiplication) Input size: (2lm+ln + 2lp+lm)*8 Output size: (2lp+ln)*32 (for lm≤16)

Choosing Optimal Parameters Source: M.S. Sadri, Zynq Training

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, Italy Microelectronic Systems Design Research Group, University of Kaiserslautern, Germany {mohammadsadegh.sadr2,luca.benini}@unibo.it, {weis,wehn}@eit.uni-kl.de ver0 45 45

Measure execution interval. Processing Task Definition We define : Different methods to accomplish the task. Measure : Execution time & Energy. Allocated by: kmalloc dma_alloc_coherent Depends on the memory Sharing method Source Image (image_size bytes) @Source Address Selection of Pakcets: (Addressing) - Normal - Bit-reversed Result Image (image_size bytes) @Dest Address 128K Loop: N times Measure execution interval. Image Sizes: 4KBytes 16K 65K 128K 256K 1MBytes 2MBytes FIFO: 128K FIR read write process 46

2 1 Memory Sharing Methods ACP Only (HP only is similar, there is no SCU and L2) Accelerator SCU L2 DRAM ACP CPU only (with&without cache) CPU CPU ACP (CPU HP similar) 2 1 Accelerator SCU L2 DRAM ACP ACP --- CPU --- ACP --- 47

Speed Comparison ACP Loses! 4K 16K 64K 128K 256K 1MBytes CPU OCM between CPU ACP & CPU HP 298MBytes/s 239MBytes/s 4K 16K 64K 128K 256K 1MBytes 48

Energy Comparison CPU OCM always between CPU ACP and CPU HP CPU only methods : worst case! CPU OCM always between CPU ACP and CPU HP CPU ACP ; always better energy than CPU HP0 When the image size grows CPU ACP converges CPU HP0 49

Lessons Learned & Conclusion If a specific task should be done by accelerator only: For small arrays ACP Only & OCM Only can be used For large arrays (>size of L2$) HP Only always acts better. If a specific task should be done by the cooperation of CPU and accelerator: CPU ACP and CPU OCM are always better than CPU HP in terms of energy If we are running other applications which heavily depend on caches, CPU OCM and then CPU HP are preferred! 50