Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

System Integration and Performance

Categories of I/O Devices

Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.

Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.

DSPs Vs General Purpose Microprocessors

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Computer Abstractions and Technology

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

ARM-DSP Multicore Considerations CT Scan Example.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

KeyStone Training Multicore Navigator Overview. Overview Agenda What is Navigator? – Definition – Architecture – Queue Manager Sub-System (QMSS) – Packet.

VIPER DSPS 1998 Slide 1 A DSP Solution to Error Concealment in Digital Video Eduardo Asbun and Edward J. Delp Video and Image Processing Laboratory (VIPER)

Real-Time Video Analysis on an Embedded Smart Camera for Trafﬁc Surveillance Presenter: Yu-Wei Fan.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

Reference: Message Passing Fundamentals.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

1: Operating Systems Overview

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

GCSE Computing - The CPU

PRASHANTHI NARAYAN NETTEM.

Multicore Design Considerations KeyStone Training Multicore Applications Literature Number: SPRP811.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Image Compression - JPEG. Video Compression MPEG –Audio compression Lossy / perceptually lossless / lossless 3 layers Models based on speech generation.

Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf

Coding techniques for digital cinema Andreja Samčović University of Belgrade Faculty of Transport and Traffic Engineering.

Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.

H.264 Deblocking Filter Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Low-Power Wireless Sensor Networks

Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.

Data Compression. Compression? Compression refers to the ways in which the amount of data needed to store an image or other file can be reduced. This.

HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.

TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.

Challenges in KeyStone Workshop Getting Ready for Hawking, Moonshot and Edison.

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.

Low-Power Wireless Video System Advisor: Professor Alex Doboli Students: Christian Austin Artur Kasperek Edward Safo.

Operating System Principles And Multitasking

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Overview von Neumann Architecture Computer component Computer function

1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.

The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

Parallel processing

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.

Parallelizing an Image Compression Toolbox

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

Vector Processing => Multimedia

ENEE 631 Project Video Codec and Shot Segmentation

Standards Presentation ECE 8873 – Data Compression and Modeling

Presentation transcript:

Multicore Design Considerations

Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming techniques. This will be a huge shift.” -- Katherine Yelick, Lawrence Berkeley National Laboratory from The Economist: Parallel BarsThe Economist: Parallel Bars Multicore is a term associated with parallel processing, which refers to the use of simultaneous processors to execute an application or multiple computational threads. Parallel programming/processing can be implemented on TI’s KeyStone multicore architecture.

Parallel Processing Parallel processing divides big applications into smaller applications and distributes tasks across multiple cores. The goal is to speed up processing of a computationally- intensive applications. Characteristics of computationally-intensive applications: –Large amount of data to process –Complex algorithms require many computations Goals of task partitioning –Computational load balancing evenly divides effort among all available cores –Minimizes contention of system resources Memory (DDR, shared L2) Transport (Teranet, peripherals)

Parallel Processing: Use Cases Network gateway, speech/voice processing Typically hundreds or thousands of channels Each channel consumes about 30 MIPS Large, complex, floating point FFT (1M) Multiple-size, short FFTs Video processing Slice-based encoder Video transcoder (low quality) High-quality decoder

Parallel Processing: Use Cases Medical imaging Filtering > reconstruction > post filtering Edge detection LTE channel excluding turbo decoder/encoder Two cores uplink Two cores downlink LTE channel including turbo decoder Equal to the performance of 30 cores Each core works on a package of bits Scientific processing Large complex matrix manipulations Use Case: Oil exploration

Parallel Processing: Control Models Master Slave Model –Multiple speech processing –Variable-size, short FFT –Video encoder slice processing –VLFFT Data Flow Model –High quality video encoder –Video decoder –Video transcoder –LTE physical layer Core 2 Core 1 Core 0 Master Slave

Parallel Processing: Partitioning Considerations Function driven –Large tasks are divided into function blocks –Function blocks are assigned to each core –The output of one core is the input of the next core –Use cases: H.264 high quality encoding and decoding, LTE Data driven –Large data sets are divided into smaller data sets –All cores perform the same process on different blocks of data –Use cases: image processing, multi-channel speech processing, sliced-based encoder

Parallel Processing: System Recommendations Ability to perform many operations –Fixed-point AND floating-point processing –SIMD instruction, multicore architecture Ability to communicate with the external world –Fast two-way peripherals that support high bit-rate traffic –Fast response to external events Ability to address large external memory –Fast and efficient save and retrieve methods –Transparent resource sharing between cores Efficient communication between cores –Synchronization –Messaging –Data sharing

Parallel Processing: Recommended Tools Easy-to-use IDE (Integrated Development Environment) –Advanced debug features (system trace, CP tracer) –Simultaneous, core-specific debug monitoring Real-time operating system (e.g., SYS/BIOS) Multicore software development kit –Standard APIs simplifies programming –Layered abstraction hides physical details from the application System optimized capabilities –Full-featured compiler, optimizer, linker –Third-party support

Example: High Def 1080i60 Video H264 Encoder A short introduction to video encoding Pixel format Macroblocks Performance numbers and limitations Motion estimation Encoding Entropy encoder Reconstruction Data in and out of the system DDR bandwidth Synchronization, data movement System architecture

Macroblock and Pixel Data RGB and YUV 4:4:4 and 4:2:0 format Typically 8-bit values (10, 12, 14) Macroblock = 16x16 pixels

Video Encoder Flow (per Macroblock) CoderWidthHeightFrames/SecondMCycles/Second D1(NTSC) D1 (PAL) P i (1088)60 fields3450 ModulePercentage Approximate MIPS (1080i)/Second Number of Cores Motion Estimation~50%17502 IP, MC, Transform, Quantization ~12.5% Entropy Encoder~25%8751 IT, IQ and Reconstruction ~12.5%

Video Coding Algorithm Limitations Motion estimation –Depends on the reconstruction of previous (and future) frames –Shortcuts can be performed (e.g., first row of frame N does not need last row of frame N-1). Intra-prediction –Depends on the macroblock above and to the left –Must be done consecutively or encoding efficiency is lost (i.e., lower quality for the same number of bits) Entropy encoding (CABAC, CAVLC) –Must be processed in the macroblock order –Each frame is independent of other frames.

How Many Channels Can One C6678 Process? Looks like two channels; Each one uses four cores. –Two cores for motion estimation –One core for entropy encoding –One core for everything else What other resources are needed? –Streaming data in and out of the system –Store and load data to and from DDR –Internal bus bandwidth –DMA availability –Synchronization between cores, especially if trying to minimize delay

What are the System Input Requirements? Stream data in and out of the system: –Raw data: 1920 * 1080 * 1.5 = 3,110,400 bytes per frame = bits per frame (~25M bits per frame) –At 30 frames per second, the input is 750 Mbps –NOTE: The order of raw data for a frame is Y component first, followed by U and V 750 Mbps input requires one of the following: –One SRIO lane (5 Gbps raw, about 3.5 Gbps of payload), –One PCIe lane (5 Gbps raw) –NOTE: KeyStone devices provide four SRIO lanes and two PCIe lanes Compressed data (e.g., 10 to 20 Mbps) can use SGMII (10M/100M/1G) or SRIO or PCIe.

How Many Accesses to the DDR? For purposes of this example, only consider frame-size accesses. All other accesses (ME vectors, parameters, compressed data, etc.) are negligible. Requirements for processing a single frame: –Retrieving data from peripheral to DDR - 25M bits = 3.125MB –Motion estimation phase reads the current frame (only Y) and older Y component of reconstruction frame(s). A good ME algorithm may read up to 6x older frame(s). 7 * 1920 * 1088 = ~ 15M Bytes –Encoding phase reads the current frame and one old frame. The total size is about 6.25 MB. –Reconstruction phase reads one frame and writes one frame. So the total bandwidth is 6.25 MB. –Frame compression before or after the entropy encoder is negligible. –Total DDR access for a single frame is less than 32 MB.

How Does This Access Avoid Contention? Total DDR access for a single frame is less than 32 MB. The total DDR access for 30 frames per second (60 fields) is less than 32 * 30 = 960 MBps. The DDR3 raw bandwidth is more than 10 GBps (1333 MHz clock and 64 bits). 10% utilization reduces contention possibilities. DDR3 DMA uses TeraNet with clock/3 and 128 bits. TeraNet bandwidth is 400 MHz * 16B = 6.4 GBps.

KeyStone SoC Architecture Resources 10 EDMA transfer controllers with 144 EDMA channels and 1152 PaRAM (parameter blocks) –The EDMA scheme must be designed by the user. –The LLD provides easy EDMA usage. In addition, Navigator has its own PKTDMA for each master. Data in and out of the system (SRIO, PCIe or SGMII) is done using the Navigator. All synchronization between cores and moving pointers to data between cores is done using the Navigator. IPC provides easy access to the Navigator.

Conclusion Two H264 high-quality 1080i encoders can be processed on a single TMS320C6678

System Architecture