The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Parallelizing Video Transcoding With Load Balancing On Cloud Computing Song Lin, Xinfeng Zhang, Qin Y, Siwei Ma Circuits and Systems, 2013 IEEE.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.

ARM-DSP Multicore Considerations CT Scan Example.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

VIPER DSPS 1998 Slide 1 A DSP Solution to Error Concealment in Digital Video Eduardo Asbun and Edward J. Delp Video and Image Processing Laboratory (VIPER)

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

10.2 Characteristics of Computer Memory RAM provides random access Most RAM is volatile.

TigerSHARC and Blackfin Different Applications. Introduction Quick overview of TigerSHARC Quick overview of Blackfin low power processor Case Study: Blackfin.

1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Video on DSP and FPGA John Johansson April 12, 2004.

1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

H.264/AVC for Wireless Applications Thomas Stockhammer, and Thomas Wiegand Institute for Communications Engineering, Munich University of Technology, Germany.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.

EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.

1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.

The hybird approach to programming clusters of multi-core architetures.

1 Motivation Video Communication over Heterogeneous Networks –Diverse client devices –Various network connection bandwidths Limitations of Scalable Video.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Microcomputer Systems Project By Shriram Kunchanapalli.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Multi-core architectures. Single-core computer Single-core CPU chip.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.

Multi-Core Architectures

Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.

Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.

Developing Power-Aware Strategies for the Blackfin Processor Steven VanderSanden Giuseppe Olivadoti David Kaeli Richard Gentile Northeastern University.

By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.

Rate-distortion Optimized Mode Selection Based on Multi-channel Realizations Markus Gärtner Davide Bertozzi Classroom Presentation 13 th March 2001.

VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

CloudStream: delivering high-quality streaming videos through a cloud-based SVC proxy Authors: Zixia Huang1, Chao Mei1, Li Erran Li2, Thomas Woo2 1Department.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

Progressive transmission of spatial data Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

Chapter 7 Input/Output and Storage Systems. 2 Chapter 7 Objectives Understand how I/O systems work, including I/O methods and architectures. Become familiar.

Olivier Bockenbach1, Ian Wainwright1, Murtaza Ali2, Mark Nadeski2.

Hiba Tariq School of Engineering

Parallel Data Laboratory, Carnegie Mellon University

Cache Memory Presentation I

Sum of Absolute Differences Hardware Accelerator

Study and Optimization of the Deblocking Filter in H

Standards Presentation ECE 8873 – Data Compression and Modeling

Numerical Algorithms Quiz questions

Bongsoo Jung, Byeungwoo Jeon

Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian

HALO-FREE DESIGN FOR RETINEX BASED REAL-TIME VIDEO ENHANCEMENT SYSTEM

Memory Considerations

Presentation transcript:

The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile Analog Devices Inc., Norwood, MA

Outline  Multi-core programming challenge  Framework requirements  Framework Methodology  Multimedia data-flow analysis  BF561 dual-core architecture analysis  Framework models  Combining Frameworks  Results  Conclusion

Multi-core Programming Challenge  To meet the growing processing demands placed by embedded applications, multi-core architectures have emerged as a promising solution  Embedded developers strive to take advantage of extra core(s) without a corresponding increase in programming complexity  Ideally, the performance increase should approach “N” times where “N” is the number of cores  Managing shared-memory and inter-core communications makes the difference!  Developing a framework to manage code and data will help to speed development time and ensure optimal performance  We target some compute intensive and high bandwidth applications on an embedded dual-core processor

Framework requirements  Scalable across multiple cores  Equal load balancing between all cores  A core data item request is always met at the L1 memory level  Minimum possible data memory footprint

Framework methodology  Understanding the parallel data-flow of the application with respect to spatial and temporal locality  Efficiently mapping the data-flow to the private and shared resources of the architecture

Multimedia Data-flow Analysis

ADSP-BF561 Dual-core Architecture Analysis  Dual-Core architecture  Private L1 code and Data memory  Shared L2 and external memory  4 Memory DMA channels  Shared peripheral interface L2 shared and Unified Code and Data (128KB ) SDRAM (L3) 4x(16 – 128 MB) SRAM SRAM/Cache L1 Code (32KB) C o re A Core B L1 Data (64KB) 9 cclk 8-10 sclk 1 cclk

Framework models  Slice/Line processing  Macro-block processing  Frame processing  GOP processing

Framework design  Data moved directly from the peripheral DMA to the lowest (Level 1 or Level 2) possible memory level based on the data access granularity  DMA is used for all data management across memory levels, saving essential core cycles in managing data  Multiple Data buffers are used to avoid core and DMA contention  Semaphores are used for inter-core communication

Line processing framework  No L2 or L3 accesses made, thereby saving external memory bandwidth and DMA resources  Only DMA channels used to manage data  Applicable examples - color conversion, histogram equalization, filtering, sampling etc. Video In Rx_Line0 Rx_Line2 Tx_Line0 Tx_Line2 Core A Internal L1 memory Rx_Line1 Rx_Line3 Tx_Line1 Tx_Line3 Core B Internal L1 memory Video Out

Macro-block processing framework  No L3 accesses  Applicable examples - edge detection, JPEG/MJPEG encoding/decoding algorithms, convolution encoding etc Video In Rx0 Rx1 Core A Internal L1 memory Core B Internal L1 memory PPI1 Tx0 Tx1 Rx0 Rx1 Tx0 Tx1 L2 Rx Buffers L2 Tx Buffers

Frame processing framework  Applicable example - motion detection Rx0 Rx1 Core A Internal L1 memory Core B Internal L1 memory Tx0 Tx1 Rx0 Rx1 Tx0 Tx1 Rx0 Rx1 Rx0 Rx1 L3 Frame Buffers

GOP processing framework  Applicable examples - encoding/decoding algorithms such as MPEG-2/MPEG-4 Rx0 Rx1 Core A Internal L1 memory Tx0 Tx1 Rx0 Rx1 Rx0 Rx1 Rx0 Rx1 Core B Internal L1 memory Tx0 Tx1 Rx0 Rx1 Rx0 Rx1 L3 Frame Buffers GOP = 4

Results TemplateCore cycles/pi xel*(appr ox.) single core Core cycles/pixe l*(approx.) - two cores L1 data memory required( bytes) L2 data memory required (bytes) Comments Line Processing 4280 (line size)*2; for ITU *2 double buffering in L1 Macro- block Processing 3672 (Macro-block size(nxm))* 2 Slice of a frame; (macroblock height *line size)*4 double buffering in L1 and L2 Frame processing 3570 (size of sub- processing block)*(num ber of dependent blocks) Only L1 or L2 cannot be used double buffering in L1 or L2

Using the Templates Identify the following items for an application  The granularity of the sub-processing block in the image processing algorithm  The available L1 and L2 data memory, as required by the specific templates.  The estimate of the computation cycles required per sub- processing block  The spatial and temporal dependencies between the sub- processing blocks. If dependencies exist, then the templates needs modification to account for data dependencies

Conclusion  Understanding the data access pattern of an application is key to efficient programming model for embedded systems  The frameworks combine techniques to efficiently manage the shared resources and exploit the known data access pattern in multimedia applications to achieve a 2X speed-up  The memory footprint is equal to the smallest data access granularity of the application  The frameworks can be combined to integrate multiple algorithms with different data access pattern within an application