Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
Embedded Systems Programming
A New Network Processor Architecture for High-speed Communication Xiaoning Nie; Gazsi, L.; Engel, F.; Fettweis, G. Signal Processing Systems, SiPS.
Core-based SoCs Testing Julien Pouget Embedded Systems Laboratory (ESLAB) Linköping University Julien Pouget Embedded Systems Laboratory (ESLAB) Linköping.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Embedded Systems Programming
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Computer Architecture Lecture 12 Fasih ur Rehman.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
AKT211 – CAO 01 - Introduction to Computer Organization and Architecture Ghifar Parahyangan Catholic University August 22, 2011 Ghifar Parahyangan Catholic.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Verification and Performance Estimation Environment for 3D Graphics Geometry Acceleration System Young-Su Kwon.
Ch. 9 Interrupt Programming and Real-Time Sysstems From Valvano’s Introduction to Embedded Systems.
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
Lecture 8 Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Automated Design of Custom Architecture Tulika Mitra
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Direct Memory Access (DMA) Microprocessors I -1. Topics to be discussed  Basic DMA Concept Basic DMA Concept  DMA pins and timing DMA pins and timing.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lab 2 Parallel processing using NIOS II processors
Computer Architecture 2 nd year (computer and Information Sc.)
Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
1 CzajkowskiMAPLD 2005/138 Radiation Hardened, Ultra Low Power, High Performance Space Computer Leveraging COTS Microelectronics With SEE Mitigation D.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
NISC set computer no-instruction
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.
Direct Memory address and 8237 dma controller LECTURE 6
Embedded Systems Design
A Top-Level View Of Computer Function And Interconnection
What Are Performance Counters?
Presentation transcript:

Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack

Contents 1. Background 2. MEMTRACE profiler 3. Software/Hardware Optimization 4. Conclusion

Background -- profiling  Profiling is used to understand the run- time behavior of applications

Efficient profiling approaches  Software profiling  Sampling, Instrumentation  Flexible but have high overhead  Hardware profiling  Performance counter  inexpensive but more rigid and may not be universally available  Hybrid Combinations of the above  Hold great potential since they combine the advantages of both without the drawbacks

An example of hardware profiling  PC – Performance Counter

Background – system analysis  Why we need profiling?  It is very important to adapt the system to the application in order to find an efficient solution.  Video coding

Contents 1. Background 2. MEMTRACE profiler 3. Software/Hardware Optimization 4. Conclusion

MEMTRACE profiler  MEMTRACE delivers cycle-accurate profiling results on a C function level.  The results include clock cycles, various memory access statistics, and optionally energy consumption estimation for reduced instruction set computer (RISC)-based processors.  A focus is placed on memory access analysis, as for data-intensive applications this aspect has a high potential for increasing system efficiency.

MEMTRACE profiling toolflow

MEMTRACE -- Initialization

MEMTRACE – Performance Analysis

MEMTRACE – Post Processing

MEMTRACE backend

MEMTRACE -- Profiling data acquisition

 init()  Initialize the profiler.  Creates a list of all functions and global variables  nextInstruction()  Checks if the program execution has changed from one function to another  If so, the cycle count of the previous function is recalculated and the call count of the new function is incremented  memoryAccess()  It is decided if a load or store access was performed, and which bit-width (8, 16, or 32-bit) was used.

MEMTRACE -- Profiling data acquisition  busActivity()  Identifies the bus status (idle cycle, core access or DMA access) and increments the appropriate counter of the current function  cacheMiss()  Is called each time a cache miss occurs  finish()  When the ISS terminates the simulation

Processor model generator

Interconnection

 What can we do by using the result of MEMTRACE profiler?

Contents 1. Background 2. MEMTRACE profiler 3. Software/Hardware Optimization 4. Conclusion

 System partitioning  Computationally intensive functions are well- suited for hardware acceleration in a coprocessor  Control-intensive functions are better suited for software implementation on ASIPs (Application Specific Instruction set Processors)

Software Optimization  Loop unrolling  For computational intensive parts, arithmetic optimizations or SIMD instructions can be applied, if such instructions are available in the processor  Video applications

Hardware Optimization  Memory Subsystem Optimizations  External memory  Cache (Cache miss) The data areas with the most cache misses and the smallest size should be stored in on-chip memory  SRAM  Instruction Set Architecture Optimizations  Frequently used instructions should be considered as targets for optimization during the processor architecture development.

Conclusion  Profiling and system analysis  MEMTRACE architecture  Initialization  Performance analysis  Post processing  Hardware/Software optimization  Software  Hardware

Lu Hao And questions?

References  [1] H Hübert, B Stabernack. Profiling-based hardware/software co-exploration for the design of video coding architectures. IEEE Transactions on Circuits and Systems for Video Technology, 2009, Pages:  [2]ST Microelectronics: Nomadik STn8820 Mobile Multimedia Application Processor (2008, Feb.). Data brief. [Online]. Available:  [3] Broadcom: BCM2820 Low Power, High Performance Application Processor (2006, Sep.). Product brief. [Online]. Available:  [4] G. de Micheli and L. Benini, Network on Chips. San Francisco, CA: Morgan Kaufmann,  [5] H. H¨ubert, “MEMTRACE: A memory, performance and energy profiler targeting RISC-based embedded systems for dataintensive applications,” Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Tech. Univ. Berlin, Germany, [Online]. Available: