© 2005 ECNU SEIPrinciples of Embedded Computing System Design1 Program design and analysis zOptimizing for execution time. zOptimizing for energy/power.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Instruction Set Design
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Parallell Processing Systems1 Chapter 4 Vector Processors.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Computer Organization and Architecture The CPU Structure.
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zCompilation flow. zBasic statement translation. zBasic optimizations.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Memory Management 2010.
Multiscalar processors
Hardware-Software Interface Machine Program Performance = t cyc x CPI x code size X Available resources statically fixed Designed to support wide variety.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Embedded Computer Systems Chapter1: Embedded Computing Eng. Husam Y. Alzaq Islamic University of Gaza.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zDesigning embedded programs is more difficult and challenging.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Automated Design of Custom Architecture Tulika Mitra
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution time. yEnergy/power. yProgram size. zProgram validation and.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
© 2000 Morgan Kaufman Overheads for Computers as Components Energy/power optimization  Energy: ability to do work.  Most important in battery-powered.
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zOptimizing for execution time. zOptimizing for energy/power. zOptimizing.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
High-Level Transformations for Embedded Computing
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zCompilation flow. zBasic statement translation. zBasic optimizations.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
1 Lecture 5a: CPU architecture 101 boris.
Code Optimization.
Introduction To Computer Systems
Embedded Systems Design
Optimization Code Optimization ©SoftMoore Consulting.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CSCI1600: Embedded and Real Time Software
Optimizing Transformations Hal Perkins Autumn 2011
Register Pressure Guided Unroll-and-Jam
Optimizing Transformations Hal Perkins Winter 2008
Virtual Memory Overcoming main memory size limitation
Code Transformation for TLB Power Reduction
Loop-Level Parallelism
CSc 453 Final Code Generation
CSCI1600: Embedded and Real Time Software
Presentation transcript:

© 2005 ECNU SEIPrinciples of Embedded Computing System Design1 Program design and analysis zOptimizing for execution time. zOptimizing for energy/power. zOptimizing for program size.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design2 Motivation (P.186) zEmbedded systems must often meet deadlines. yFaster may not be fast enough. zNeed to be able to analyze execution time. yWorst-case, not typical. zNeed techniques for reliably improving execution time.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design3 Run times will vary (P.186) zProgram execution times depend on several factors: yInput data values. yState of the instruction, data caches. yPipelining effects.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design4 Measuring program speed zCPU simulator. yI/O may be hard. yMay not be totally accurate. zHardware timer. yRequires board, instrumented program. zLogic analyzer. yLimited logic analyzer memory size.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design5 Program performance metrics zAverage-case: yFor typical data values, whatever they are. zWorst-case: yFor any possible input set. zBest-case: yFor any possible input set. zToo-fast programs may cause critical races at system level.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design6 What data values? zWhat values create worst/average/best case behavior? yanalysis; yexperimentation. zConcerns: yoperations; yprogram paths.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design7 Performance analysis (P.187) zElements of program performance : yexecution time = program path + instruction timing yPath depends on data values. Choose which case you are interested in. yInstruction timing depends on pipelining, cache behavior.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design8 Programs and performance analysis zBest results come from analyzing optimized instructions, not high-level language code: ynon-obvious translations of HLL statements into instructions; ycode may move; ycache effects are hard to predict.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design9 Program paths (P.188) zConsider for loop: for (i=0, f=0, i<N; i++) f = f + c[i]*x[i]; zLoop initiation block executed once. zLoop test executed N+1 times. zLoop body and variable update executed N times. i<N i=0; f=0; f = f + c[i]*x[i]; i = i+1; N Y test body update initialization

© 2005 ECNU SEIPrinciples of Embedded Computing System Design10 Instruction timing (P.189) zNot all instructions take the same amount of time. yHard to get execution time data for instructions. zInstruction execution times are not independent. zExecution time may depend on operand values.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design11 Trace-driven performance analysis (P.189) zTrace: a record of the execution path of a program. zTrace gives execution path for performance analysis. zA useful trace: yrequires proper input values; yis large (gigabytes). Trace processors Rotenberg, E.; Jacobson, Q.; Sazeides, Y.; Smith, J.; Microarchitecture, Proceedings. Thirtieth Annual IEEE/ACM International Symposium on, 1-3 Dec 1997 Page(s):

© 2005 ECNU SEIPrinciples of Embedded Computing System Design12 Trace generation (P.190) zHardware capture: ylogic analyzer; yhardware assist in CPU. zSoftware: yPC sampling. yInstrumentation instructions. ySimulation.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design13 Trace scheduling Trace scheduling: the most likely path is found, and its basic blocks are merged into one. Bookkeeping is required to ensure correctness.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design14 Loop optimizations (P.191) zLoops are good targets for optimization. zBasic loop optimizations: ycode motion; yinduction-variable elimination; ystrength reduction (x*2 x<<1).

© 2005 ECNU SEIPrinciples of Embedded Computing System Design15 Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i<N*M i=0; z[i] = a[i] + b[i]; i = i+1; N Y i<X i=0; X = N*M

© 2005 ECNU SEIPrinciples of Embedded Computing System Design16 Induction variable elimination zInduction variable: loop index. zConsider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i][j] = b[i][j]; zRather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body. Cf. P.192

© 2005 ECNU SEIPrinciples of Embedded Computing System Design17 Cache analysis zLoop nest: set of loops, one inside other. yRewrite loop nest to change the order of access array. zPerfect loop nest: no conditionals in nest. zBecause loops use large quantities of data, cache conflicts are common.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design18 Array conflicts in cache (P.194) a[0][0] b[0][0] main memory cache pad

© 2005 ECNU SEIPrinciples of Embedded Computing System Design19 Array conflicts, cont’d. zArray elements conflict because they are in the same line, even if not mapped to same location. zSolutions: ymove one array; ypad array.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design20 zUse registers efficiently. zUse page mode memory accesses. zAnalyze cache behavior: yinstruction conflicts can be handled by rewriting code, rescheudling; yconflicting scalar data can easily be moved; yconflicting array data can be moved, padded. Performance optimization hints

© 2005 ECNU SEIPrinciples of Embedded Computing System Design21 Energy/power optimization (P.195) zEnergy: ability to do work. yMost important in battery-powered systems. zPower: energy per unit time. yImportant even in wall-plug systems---power becomes heat.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design22 Measuring energy consumption zExecute a small loop, measure current: while (TRUE) a(); I CPU

© 2005 ECNU SEIPrinciples of Embedded Computing System Design23 Sources of energy consumption zRelative energy per operation (Catthoor et al): ymemory transfer: 33 yexternal I/O: 10 ySRAM write: 9 ySRAM read: 4.4 ymultiply: 3.6 yadd: 1 Cf. Fig.5-26 P.196

© 2005 ECNU SEIPrinciples of Embedded Computing System Design24 Cache behavior is important zEnergy consumption has a sweet spot as cache size changes: ycache too small: program thrashes, burning energy on external memory accesses; ycache too large: cache itself burns too much power. Cf. Fig.5-27 P.197 cache ~ energy cache ~ execute time

© 2005 ECNU SEIPrinciples of Embedded Computing System Design25 Optimizing for energy (P.198) zFirst-order optimization: yhigh performance = low energy. zNot many instructions trade speed for energy. ?

© 2005 ECNU SEIPrinciples of Embedded Computing System Design26 Optimizing for energy, cont’d. zUse registers efficiently. zIdentify and eliminate cache conflicts. zUse page mode memory accesses. zModerate loop unrolling eliminates some loop overhead instructions. zEliminate pipeline stalls. zInlining procedures may help: reduces linkage, but may increase cache thrashing.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design27 Optimizing for program size zGoal: yreduce hardware cost of memory; yreduce power consumption of memory units. zTwo opportunities: ydata; yinstructions.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design28 Data size minimization zReuse constants, variables, data buffers in different parts of code. yRequires careful verification of correctness. yEliminates the copy of data zGenerate data using instructions.

© 2005 ECNU SEIPrinciples of Embedded Computing System Design29 Reducing code size zAvoid function inlining. zChoose CPU with compact instructions. yARM Thumb yMIPS-16 yVariable length of instruction zUse specialized instructions where possible. yRPTS/RPTB zCode compression contradiction ?

© 2005 ECNU SEIPrinciples of Embedded Computing System Design30 Code compression (P.199) zUse statistical compression to reduce code size, decompress on-the-fly: CPU decompressor table cache main memory LDR r0,[r4]