Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.

Slides:



Advertisements
Similar presentations
Bus Specification Embedded Systems Design and Implementation Witawas Srisa-an.
Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Distributed Systems CS
CS1104: Computer Organisation School of Computing National University of Singapore.
Computer Abstractions and Technology
Presenter : Cheng_Ta Wu Shunitsu Kohara† Naoki Tomono†,‡ Jumpei Uchida† Yuichiro Miyaoka†,∗Nozomu Togawa‡ Masao Yanagisawa† Tatsuo Ohtsuki† † Department.
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
NoC Modeling Networks-on-Chips seminar May, 2008 Anton Lavro.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Double buffer SDRAM Memory Controller Presented by: Yael Dresner Andre Steiner Instructed by: Michael Levilov Project Number: D0713.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
CSCE 313: Embedded Systems Multiprocessor Systems
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Computer System Architectures Computer System Software
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Parallelism Processing more than one instruction at a time. Pipelining
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
1 Presenter: Min Yu,Lo 2015/10/9 Lauri Matilainen, Erno Salminen, Timo D. Hamalainen, and Marko Hannikainen International Conference on Embedded.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Soc 5.1 Chapter 5 Interconnect Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)
-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
If you have a transaction processing system, John Meisenbacher
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
What is Fibre Channel? What is Fibre Channel? Introduction
Texas Instruments TDA2x and Vision SDK
FPGAs in AWS and First Use Cases, Kees Vissers
CSE8380 Parallel and Distributed Processing Presentation
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
By Brandon, Ben, and Lee Parallel Computing.
Computer Evolution and Performance
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Presentation transcript:

Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer Systems, P.O. Box 553, Korkeakoulunkatu 1, FI Tampere, Finland {antti.rasmus,

Abstract What’s the problem Case study Experiment Results Conclusion

Current system-on-chip implementations integrate IP blocks from different vendors. Typical problems are incompatibility and integration overheads. This paper presents a case study of integrating two black- box hardware accelerators into highly scalable and modular multiprocessor system-on-chip architecture. The integration was implemented by creating two wrapper components that adapt the interfaces of the hardware accelerators for the used architecture and on- chip network. The benefit of the accelerators was measured in three different configurations and especially the execution time overheads caused by the software, data delivery, and shared resource contention were extracted and analyzed in MPEG-4 encoder. The overheads increase the function runtime up to 20x compared to the ideal acceleration. In addition, the accelerator that seemed to be more efficient performed worse in practice. As a conclusion, it is pointed out that the integration induces great overhead to the execution time, rendering a-few-clock-cycle optimizations within the accelerator meaningless.

Most papers don’t concentrate on quantitative analysis of the associated performance overheads.  Software overheads, data delivery delays, shared resource contention. This paper presents a case study of integrating two hardware accelerators in order to reduce processor load and improve existing embedded MPEG-4 video encoder application performance on FPGA.  The integration overhead components are separated and analyzed.

Simulation :  The ideal reference situation where the hardware accelerator is attached to the simulation test bench, and data are always available to hardware accelerator. 。 Measure the pure hardware accelerator execution time. Simple FPGA test:  CPU runs a test program, which is dedicated to run the accelerators consecutively. All the resources are available when required, since there is no paralle processing in the system. 。 Measure the data delivery time. Video encoder:  With two encoding slave CPUs both of the hardware accelerators, the RM and SDRAM controller. 。 Measure the contention of share resource such as bus, SDRAM

The case with the both accelerators and 1+2 CPUs is faster than 1+3 CPUs without acceleration. DCT-Q-IDCT provides 20-21% speed-up to the frame rate, ME 15-17% and both together %.

Firstly, the difference in cycle count between simulation and simple FPGA test is 50% due to the software overhead, wrapper component, and introduced HIBI communication. Secondly, the difference between the simple FPGA test and HW in encoder is 49%. Here, the increase is due to the contention of communication and the RM access that also includes the waiting time for getting access to the accelerator if it is being used at the time of query.

When compared to the simulation, execution time shows in simple FPGA test 890% and in encoder 2160% increment. In FPGA test, the ME hardware has to wait for data since it has 128- bit input and SDRAM is only 32 bit wide, and wrapper also causes an extra latency. In encoder, large amount of data is transferred to and from the SDRAM controller over the on-chip network, and the bus utilization grows when multiple parallel operations are running in the system.

The proportions of integration overhead are 96% and 55% with ME and DCT-Q-IDCT, respectively. The contention is relatively bigger with the ME because the ME has longer data transfers, and instead of only competing for bus like the DCT-Q-IDCT, the ME has to wait for the access to SDRAM as well. ME has clearly shorter pure hardware execution time. However, the overall time needed for ME during the encoding is larger due to bigger overheads. Similarly, running the accelerator with higher frequency does not offer notable speed-up since the overheads remain the same.

In our case, the proportions of integration overheads for the execution time were 96% and 55%. The actual performance gain of the accelerators differs from that the background data indicate: the accelerator that seemed to be more efficient performed worse in our system. Such high variation in execution times complicates SoC design, and should not be ignored in design time especially in real-time systems.