The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Slides:



Advertisements
Similar presentations
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Advertisements

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.
Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
(1) Introduction © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
October 6, 2004.Software Technology Forum 1 The Renaissance of Compiler Development Com piler optimizations motivated by embedded systems Tibor Gyimóthy.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Power Estimation and Optimization for SoC Design
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
Dual-Pipeline Heterogeneous ASIP Design Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran School of Computer Science & Engineering University of New.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
NISC set computer no-instruction
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Dynamo: A Runtime Codesign Environment
Evaluating Register File Size
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Department of Electrical & Computer Engineering
Methodology of a Compiler that Compresses Code using Echo Instructions
Dynamically Reconfigurable Architectures: An Overview
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Ann Gordon-Ross and Frank Vahid*
A High Performance SoC: PkunityTM
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Mapping DSP algorithms to a general purpose out-of-order processor
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, Thessaloniki, Greece Aristotle University of Thessaloniki

Scope A methodology for the implementation of an ASIP, from a hardware-software perspective, is followed An ASIP for multimedia applications is designed The effect of data-reuse transformations, in terms of energy and performance, on a multimedia application executed on ASIP and GPP is studied Aristotle University of Thessaloniki

Popularity of portable multimedia applications Great need for power optimization strategies, especially in higher design levels Code transformations aiming at a memory hierarchy provide significant power savings While ASICs lack flexibility and GPP are prohibitively expensive in terms of energy-performance  the embedded systems industry has an increasing interest in ASIPs Aristotle University of Thessaloniki Motive

Data Reuse Transformations In data-dominated applications significant power savings can be achieved by developing a custom memory organization The two dimensional Three- Step Search(TSS) algorithm is used as benchmark The custom memory organization for this benchmark is designed Aristotle University of Thessaloniki

ASIP Design Flow- Architecture Template A RISC, MIPS-like machine is used as the base processor Target is extensions on the instruction set of the processor, beneficial in terms of performance and power consumption. Aristotle University of Thessaloniki

ASIP Design Flow- Front-End Compilation The TSS algorithms with the different data reuse transformations described in C programming language were compiled The GNU-GCC for embedded architectures, configured as a cross-compiler for the MIPS architecture, was used Aristotle University of Thessaloniki

ASIP Design Flow- Dynamic Profiling Dynamic profiling with the GNU tools (gcc, binutils, gdb) configured for the MIPS processor was performed Heavily executed portions of the code was identified  Loop iteration overhead is 24% of the total execution cycles  Addressing generation instructions are 62% of the total execution cycles  Only 14% of the execution time is consumed on pure computational micro-operations New candidate instructions from which the application can benefit, revealed Aristotle University of Thessaloniki

ASIP Design Flow- Instruction Set Extensions DescriptionAdditional Hardware RequirementsPenalty Inc+Branch_Rs_Rd_TargetControl Logic + Incrementer UnitArea+Delay Add+SW_L#_Rs_Rt_RdControl LogicArea Add+LW_L#_Rs_Rt_RdControl LogicArea L# is the desired level of the custom memory hierarchy “Increment and Branch” instruction to reduce Loop iteration overhead Store/Load Word with addition for address calculation (one cycle using pipeline) Direct support of the custom memory hierarchy Aristotle University of Thessaloniki

ASIP Design Flow- Code Re-Generation Original code is parsed and the MOPs are reordered to construct the instruction extensions-patterns Patterns are substituted by the new defined instructions MOPs are reordered to keep the pipeline as full as possibly Aristotle University of Thessaloniki

ASIP Design Flow- Cycle Accurate and Hardware Models A Cycle Accurate simulation model, in the SystemC language, was constructed A hardware model in VHDL language was designed Execution frequency of instructions and access to crucial hardware components was collected Specifications were determined by synthesis on a popular standard cell technology Aristotle University of Thessaloniki

Experimental Results The different versions of the TSS application code were compiled for the AMR9TDMI and the ASIP cores Cycle accurate simulations were performed on the ARMulator and the SystemC simulator respectively The TSS was executed on digital pictures of MxN=144x176 pixels. The block size B was set to 16 while the search window size [-p,p] was set to [-7,7]. Aristotle University of Thessaloniki

Performance Results Performance gain of 29%(ARM9TDMI ) and 54%(ASIP ) for P4 compared to the original TSS ASIP is capable to deliver 54% performance gain compared to ARM9TDMI core. ASIP delivers 250Mhz performance with STM 0.18um technology ARM9TDMI implemented in the same technology process delivers 200MHz Aristotle University of Thessaloniki

Energy Results SRAM memories with appropriate size were used for each layer of the data memory ROM memory was used for the instruction memory. Because of the 50% smaller code size, compared to ARM, that ASIP provides, ROM instruction memory was used with sizes 4KB and 2KB respectively. Aristotle University of Thessaloniki

Energy Results Energy consumption is dominated by the energy consumption due to access on the instruction memory P4 delivers 32%(ARM) and 54%(ASIP) energy savings compared with the original TSS. 42% energy savings can be achieved by using the P4 transformation on the ASIP compared to ARM9TDMI These energy savings result from the smaller number of access to the Instruction Memory but also due to the smaller Instruction Memory size of ASIP. Aristotle University of Thessaloniki

Conclusions Both solutions, namely ASIP and GPP, can benefit in terms of performance and energy consumption by selecting the appropriate custom data memory hierarchy. ASIP can achieve highest performance and energy reduction through this hierarchy, compared to a GPP. Aristotle University of Thessaloniki

Performance Results Aristotle University of Thessaloniki

Energy Results Aristotle University of Thessaloniki