Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.

Slides:

Advertisements

Similar presentations

Clare Smtih SHARC Presentation1 The SHARC Super Harvard Architecture Computer.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs.

CPU Review and Programming Models CT101 – Computing Systems.

Lecture 6: Multicore Systems

Computer Abstractions and Technology

The University of Adelaide, School of Computer Science

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Power Reduction Techniques For Microprocessor Systems

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

Chapter 1 Algorithm Analysis

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

Computing Hardware Starter.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

October 6, 2004.Software Technology Forum 1 The Renaissance of Compiler Development Com piler optimizations motivated by embedded systems Tibor Gyimóthy.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

What is a Microprocessor ? A microprocessor consists of an ALU to perform arithmetic and logic manipulations, registers, and a control unit Its has some.

Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Memory-Aware Compilation Philip Sweany 10/20/2011.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Computer Architecture & Operations I

Evaluating Register File Size

Microprocessor and Assembly Language

Optimization Code Optimization ©SoftMoore Consulting.

Green Software Engineering Prof

Improving Program Efficiency by Packing Instructions Into Registers

Array Processor.

Hui Chen, Shinan Wang and Weisong Shi Wayne State University

2.C Memory GCSE Computing Langley Park School for Boys.

Chapter 12 Pipelining and RISC

Presentation transcript:

Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied Informatics University of Macedonia, Greece

Motivation Low Power Requirements for Portable Systems - Battery Lifetime - Integration Scale - Cooling/Reliability Issues Challenge: Increased performance  increased power Existing Low-Level Tools for Energy Estimation Processor PowerMemory Power University of Macedonia Widespread application of embedded systems

Does Software Affect Power Consumption ? Until recently, power reduction was the goal of hardware optimizations (transistor sizing, supply voltage reduction etc) Tiwari (1994, 1996) proved that software has a significant impact on the energy consumption of the underlying hardware, which can be measured Software addresses higher levels of the design hierarchy  Therefore, energy savings are larger Moreover, for software there is no tradeoff between performance and power: Fewer instructions lead to reduced power University of Macedonia

Sources of Power Consumption Power dissipation in digital systems is due to charging/discharging of node capacitances : University of Macedonia However: Dynamic Power:switchingactivity

Sources of Power Consumption Sources of power consumption in an embedded system - Instruction level power consumption (power consumed during the processor operation) - Instruction and Data Memories (power consumed when accessing memories) - Interconnect switching (power consumed when bus lines change state) University of Macedonia

Instruction Level Power Models University of Macedonia Instruction Energy Base Cost ADD R2, R0, #1 Overhead Cost ADD R2, R0, #1 CMP R2, #0 Energy consumption of a program (Tiwari et al.)

Processor Energy Consumption 6-8 % University of Macedonia

Instruction Level Power Models University of Macedonia

Memory Power Consumption University of Macedonia Energy cost of a memory access >> instruction energy Depends on: - number of accesses (directly proportional) - size of memory (between linear and logarithmic) - number of ports, power supply, technology Instruction Memory Power, depends on code size  required memory size #executed instructions  #accesses Data Memory Power depends on Amount of data being processed  memory size On whether the application is data-intensive  #accesses

OOPACK Benchmarks Small suite of kernels that compares the relative performance of object oriented programming in C++ versus plain C-style code: University of Macedonia Max Max: Computes the maximum over a vector Aim: Aim: To measure how well a compiler inlines a function within a conditional C-style: C-style: performs the comparison between two elements explicitly OOP: OOP: performs the comparison by calling an inline function.

OOPACK Benchmarks Matrix Matrix: multiplies two matrices containing real numbers Aim: Aim: to measure how well a compiler hoists simple invariantsC-style: University of Macedonia where, for example, the term L*i is constant for each iteration of k and should be computed as an invariant outside the k loop.

OOPACK Benchmarks OOP: OOP: performs the multiplication employing member functions and overloading to access an element, given the row and the column. Modern C compilers are good enough at this sort of optimization for scalars. However, in OOP style, invariants often concern members of objects. Optimizers that do not peer into objects miss the opportunities. University of Macedonia

OOPACK Benchmarks Iterator Iterator: computes a dot-product Aim: Aim: to measure how well a compiler inlines short-lived small objects (short-lived object should never reside in main memory; its entire lifetime should be spent inside registers) University of Macedonia OOP: OOP: employs iterators Iterators are a common abstraction in OOP. Although iterators are usually called "light-weight" objects, they may incur a high cost if compiled inefficiently. All methods of the iterator are inline and in principle correspond exactly to the C-style code. C-style: C-style: uses a common single index for( int i=0; i<N; i++ ) sum += A[i]*B[i];

OOPACK Benchmarks Complex: Complex: multiplies the elements of two arrays containing complex numbers Aim: Aim: to measure how well a compiler eliminates temporaries C-style: C-style: the calculation is performed by explicitly writing out the real and imaginary parts OOP: OOP: complex addition and multiplication is done using overloaded operations Complex numbers are a common abstraction in scientific programming. The complex arithmetic is all inlined in the OOP-style, so in principle the code should run as fast as the version using explicit real and imaginary parts. University of Macedonia

OOPACK Benchmarks Temporaries are eliminated: Y[k].re = Y[k].re + c.re*X[k].re – c.im*X[k].im; Y[k].im = Y[k].im + c.re*X[k].im + c.im*X[k].re; University of Macedonia SAXPY operation: Y = Y + c*X(c is scalar, X and Y are vectors) Calculation employing temporaries: tmp1.re = c.re * X[k].re – c.im * X[k].im; tmp1.im = c.re * X[k].im + c.im * X[k].re; tmp2. re = Y[k].re + tmp1.re; tmp2.im = Y[k].im + tmp1.im; Y[k] = tmp2; Dynamically allocating and deleting temporaries causes severe performance loss for small vectors

Target Architecture Processing unit: ARM7 TDMI Dedicated instruction memory (on-chip ROM) On-chip data memory University of Macedonia ARM7 integer processor core (3stage-pipeline) Bus Interface ROM controller RAM controller Memory interface signals A [31:0] D [31:0] Instruction memory Chip boundary Data memory

OOPACK Benchmark Code size RAM requirements #instructions#memory accesses ARM STD 2.50ARM Debugger Trace File Profiler Processor Energy Data Memory Energy Instruction Memory Energy Memory Model  Total Power

Results – Performance Comparison University of Macedonia

Results – Memory Comparison University of Macedonia

Results – Energy Comparison (mJ) University of Macedonia

OOPACK1 – Energy distribution (mJ) University of Macedonia

Conclusions University of Macedonia Power Consumption should be taken into account in the design of an embedded system. OOP can result in a significant increase of both execution time and power consumption. If a compiler cannot optimize code to reach the level of procedural programming performance, the number of executed instructions increases, increasing proportionally the instruction level power consumption. Especially in large programs, data abstraction can lead to a large code size increase, resulting in higher power consumption of the instruction memory.

Future Work University of Macedonia Currently building an accurate energy profiler (considering cache layers, pipeline stalls) Compare large programs implemented following the object oriented and the procedural programming paradigm Perform the comparisons for other compilers Identify energy-consuming programming structures and automatically convert them to energy efficient ones