Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

Slides:

Advertisements

Similar presentations

Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.

Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Low Power TCAM Forwarding Engine for IP Packets Authors: Alireza Mahini, Reza Berangi, Seyedeh Fatemeh and Hamidreza Mahini Presenter: Yi-Sheng, Lin (

Reporter: Bo-Yi Shiu Date: 2011/05/27 Virtual Point-to-Point Connections for NoCs Mehdi Modarressi, Arash Tavakkol, and Hamid Sarbazi- Azad IEEE TRANSACTIONS.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

NETWORK ON CHIP ROUTER Students : Itzik Ben - shushan Jonathan Silber Instructor : Isaschar Walter Final presentation part A Winter 2006.

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.

Networking Theory (Part 1). Introduction Overview of the basic concepts of networking Also discusses essential topics of networking theory.

Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Chapter 10 Introduction to Wide Area Networks Data Communications and Computer Networks: A Business User’s Approach.

Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.

Router modeling using Ptolemy Xuanming Dong and Amit Mahajan May 15, 2002 EE290N.

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

Communication issues for NOC By Farhadur Arifin. Objective: Future system of NOC will have strong requirment on reusability and communication performance.

On-Chip Networks and Testing

Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

SOC Virtual Prototyping: An Approach towards fast System- On-Chip Solution Date – 09 th April 2012 Mamta CHALANA Tech Leader ST Microelectronics Pvt. Ltd,

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

Virtual-Channel Flow Control William J. Dally

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

Network Layer4-1 Chapter 4 Network Layer All material copyright J.F Kurose and K.W. Ross, All Rights Reserved Computer Networking: A Top Down.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Distributed Sequencing for Resource Sharing in Multi-Applicative Heterogeneous NoC Platforms 林鼎原 Department of Electrical Engineering National Cheng Kung.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Mohamed Abdelfattah Vaughn Betz

Exploring Concentration and Channel Slicing in On-chip Network Router

Azeddien M. Sllame, Amani Hasan Abdelkader

Lecture 23: Router Design

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

Packet Switch Architectures

Scalable Memory-Less Architecture for String Matching With FPGAs

Switching Techniques.

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Chapter 3 Part 3 Switching and Bridging

2019/5/2 Using Path Label Routing in Wide Area Software-Defined Networks with OpenFlow ICNP = International Conference on Network Protocols Presenter：Hung-Yen.

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Multiprocessors and Multi-computers

Presentation transcript:

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Outline Abstract Introduction NoC Architecture Encoder Task Graph Task Profiling App. Perform on NoC App. Mapping on Processors Results Analysis Conclusion 2

Abstract 3

Introduction Complexity, scalability and portability are becoming essential topics to be solved when designing digital systems nowadays. Whilst advances in fabrication technology have allowed embedded platforms to integrate a high amount of hardware resources  the technology to intercommunicate them has been moving from typical hierarchical bus connections into network-based solutions called Network On Chip. To ease and optimize information in Many-Core architectures, one way to interconnect cores is through networks. There are also challenges when designing NoCs, both in the HW/SW fields:  Regarding HW, considerations related to topology, router architecture and network interface structure, can lead to considerably different results depending on the design.  On the SW side, the main obstacle is to define the programming model for the NoC- based system, as both shared and distributed memory approaches have their drawbacks. This paper found the distributed memory model more suitable for a network-based architecture and decided to use it with a message passing structure as the Message Passing Interface (MPI).  The MPI approach allows performing several mappings with little programming effort. 4

NoC Architecture (1/2) The core of the NoC is composed by routers and network interface cards (NIC)  routers are in charge of delivering the information in form of packets (flits) from source to destination;  network cards receive transactions from end-modules, translate them into flits and send them to the router's network for distribution. Define router model with the following structure: 1) Switching Technique: Wormhole packet-based. 2) Routing Algorithm: Either XY, West-First or North-Last. 3) Flow Control: Handshaking ACK/NACK signals. 4) Virtual Circuits: Four at each input; one per output port. Variable depth. 5) Link width: 32 bits. 6) Output Arbitration: Round-Robin. 5

NoC Architecture (2/2) As the application has to be written in MPI, all calls to mpi_send() on one core, must match one mpi_receive() on another. End-to-end flow control is handled as: 1) Call to mpi_send(): The core notifies the NIC to start packing data and keep it on a local buffer ready to be sent. 2) Call to mpi_receive(): The core asks the NIC to send a data-request message (1 flit long) to the corresponding address so that the transfer starts.  A timer is set to re-send the request after a while if no data is received. 6 Fig. 1. NoC parameterizable proposed architecture.

Encoder Task Graph In order to obtain a detailed and optimized functional partitioning, a task graph was created to identify parallelism and temporal dependence. 7 Fig. 2. JPEG Encoding Algorithm Task Graph.

Task Profiling Some criteria is needed before mapping each task to the NoC platform, therefore, a profiling for each one is suggested to identify heavy computations and algorithm bottlenecks. Associated cost were assigned to measure processor time  1 time unit for sums, loads, stores and logical operations  2 time units for multiplications and divisions For fixed tasks such as the RGB to YUV:  for DCT and quantization, it is possible to estimate the number of operations  for encoding and bit-stream writing, they are block-depending operations and their computing cost will depend on the amount of redundant information of the image. 8 Table 1. Aver. cost of the JPEG encoding Alg (per iteration).

App. Perform on NoC This work bases on the task graph and profiling to perform different mappings of the JPEG encoding application to the NoC to analyze its performance. Each of the listed tasks was manually assigned to the processing units according to the cost. 9

App. Mapping on Processors (1/4) 10 Fig. 3. JPEG Encoder Evaluated Mappings. Tests were carried on with 4, 6 and 8 processors. Each processor computes one of the tasks shown in Fig.2 for specific image components.

App. Mapping on Processors (2/4) 11 Fig. 4. JPEG encoder performance on mesh NoCs with XY- Routing, 2 Flits/VC and network speed equal to half the processors's one. Changes in router parameters, as routing algorithm, topology and VC depth, don't yield significant improvements.

App. Mapping on Processors (3/4) In order to analyse the impact of synthesis technology for the NoC, router's and NIC's speed was lowered to -3X and -4X  X is the processor' speed. From fig. 5, the mapping appropriately improves the encoding when  for 6 and 8 processors and the network is 3X slower  for 8 processor and the network is 4X slower. 12 Fig. 5. JPEG relative performance for network speeds -3X & -4X (X is processor speed). Image size was 512x512 pixels.

App. Mapping on Processors (4/4) In order to generalize the all results, a final simulation was performed with different image sizes, see Fig. 6 For the proposed task partitioning and mapping, the gain with 4 processors is around 24-25%, with 6 around 45-46% and with 8, 49-50%, irrespective of the image size. 13 Fig. 6. Application performance for different image sizes.

Results Analysis There is one consistent behavior on the previous subsection: performance (execution time) increases with the number of cores. From Fig.4,  the gain obtained by increasing 1 to 4 and 4 to 6 processors is around 25~27% each  the enhancement acquired from 6 to 8 cores is only 8~9%, but the area cost is very high. Even though an attempt to cover most significant simulation aspects at high level was done, it's not clear what criteria should be consider as better:  latency, execution time, computation/communication rate, traffic distribution, area consumption, … etc. There is no single criteria to solve such a crossroad  only design restrictions and specifications might provide a guide to get to a satisfactory answer. 14

Conclusion It was possible to correctly validate at the functional and architectural level.  several simulations were executed in short time and allowed performing numerous analysis. The previously results provide the designer with an overview of the amount of variables.  The variables are that have to be taken into account when dealing with multi-processor platforms on NoC structures. 15