© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zOptimizing for execution time. zOptimizing for energy/power. zOptimizing.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Instruction Set Design
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Program design and analysis zCompilation flow. zBasic statement translation. zBasic optimizations.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution.
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Parallell Processing Systems1 Chapter 4 Vector Processors.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zCompilation flow. zBasic statement translation. zBasic optimizations.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Hardware-Software Interface Machine Program Performance = t cyc x CPI x code size X Available resources statically fixed Designed to support wide variety.
Embedded Computer Systems Chapter1: Embedded Computing Eng. Husam Y. Alzaq Islamic University of Gaza.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
© 2000 Morgan Kaufman Overheads for Computers as Components Instruction sets zComputer architecture taxonomy. zAssembly language. ARM processor  PDA’s,
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zDesigning embedded programs is more difficult and challenging.
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zProgram validation and testing.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Program design and analysis zProgram-level performance analysis. zOptimizing for: yExecution time. yEnergy/power. yProgram size. zProgram validation and.
© 2000 Morgan Kaufman Overheads for Computers as Components Energy/power optimization  Energy: ability to do work.  Most important in battery-powered.
© 2005 ECNU SEIPrinciples of Embedded Computing System Design1 Program design and analysis zOptimizing for execution time. zOptimizing for energy/power.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
High-Level Transformations for Embedded Computing
Chapter 7 Object Code Generation. Chapter 7 -- Object Code Generation2  Statements in 3AC are simple enough that it is usually no great problem to map.
© 2000 Morgan Kaufman Overheads for Computers as Components Architectures and instruction sets zComputer architecture taxonomy. zAssembly language.
© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zCompilation flow. zBasic statement translation. zBasic optimizations.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
© 2000 Morgan Kaufman Overheads for Computers as Components Accelerators zAccelerated systems. zSystem design: yperformance analysis; yscheduling and.
© 2000 Morgan Kaufman Overheads for Computers as Components Host/target design  Use a host system to prepare software for target system: target system.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization.
Understanding Operating Systems Seventh Edition
Introduction To Computer Systems
Embedded Systems Design
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Microcomputer Architecture
CSCI1600: Embedded and Real Time Software
Optimizing Transformations Hal Perkins Autumn 2011
Register Pressure Guided Unroll-and-Jam
Overheads for Computers as Components 2nd ed.
Optimizing Transformations Hal Perkins Winter 2008
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Virtual Memory Overcoming main memory size limitation
Loop-Level Parallelism
Introduction to Optimization
CSCI1600: Embedded and Real Time Software
Presentation transcript:

© 2000 Morgan Kaufman Overheads for Computers as Components Program design and analysis zOptimizing for execution time. zOptimizing for energy/power. zOptimizing for program size.

© 2000 Morgan Kaufman Overheads for Computers as Components Motivation zEmbedded systems must often meet deadlines. yFaster may not be fast enough. zNeed to be able to analyze execution time. yWorst-case, not typical. zNeed techniques for reliably improving execution time.

© 2000 Morgan Kaufman Overheads for Computers as Components Run times will vary zProgram execution times depend on several factors: yInput data values. yState of the instruction, data caches. yPipelining effects.

© 2000 Morgan Kaufman Overheads for Computers as Components Measuring program speed zCPU simulator. yI/O may be hard. yMay not be totally accurate. zHardware profiler/timer. yRequires board, instrumented program. zLogic analyzer. yLimited logic analyzer memory depth.

© 2000 Morgan Kaufman Overheads for Computers as Components Program performance metrics zAverage-case: yFor typical data values, whatever they are. zWorst-case: yFor any possible input set. zBest-case: yFor any possible input set. zToo-fast programs may cause critical races at system level.

© 2000 Morgan Kaufman Overheads for Computers as Components Performance analysis zElements of program performance (Shaw): yexecution time = program path + instruction timing zPath depends on data values. Choose which case you are interested in. zInstruction timing depends on pipelining, cache behavior.

© 2000 Morgan Kaufman Overheads for Computers as Components Programs and performance analysis zBest results come from analyzing optimized instructions, not high-level language code: ynon-obvious translations of HLL statements into instructions; ycode may move; ycache effects are hard to predict.

© 2000 Morgan Kaufman Overheads for Computers as Components Program paths zConsider for loop: for (i=0, f=0; i<N; i++) f = f + c[i]*x[i]; zLoop initiation block executed once. zLoop test executed N+1 times. zLoop body and variable update executed N times. i<N i=0; f=0; f = f + c[i]*x[i]; i = i+1; N Y

© 2000 Morgan Kaufman Overheads for Computers as Components Instruction timing zNot all instructions take the same amount of time. zInstruction execution times are not independent. zExecution time may depend on operand values.

© 2000 Morgan Kaufman Overheads for Computers as Components Trace-driven performance analysis zTrace: a record of the execution path of a program. zTrace gives execution path for performance analysis. zA useful trace: yrequires proper input values; yis large (gigabytes).

© 2000 Morgan Kaufman Overheads for Computers as Components Trace generation zHardware capture: ylogic analyzer; yhardware assist in CPU. zSoftware: yPC sampling. yInstrumentation instructions. ySimulation.

© 2000 Morgan Kaufman Overheads for Computers as Components Loop optimizations zLoops are good targets for optimization. zBasic loop optimizations: ycode motion; yinduction-variable elimination; ystrength reduction (x*2 -> x<<1).

© 2000 Morgan Kaufman Overheads for Computers as Components Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i<N*M i=0; z[i] = a[i] + b[i]; i = i+1; N Y i<X i=0; X = N*M

© 2000 Morgan Kaufman Overheads for Computers as Components Induction variable elimination zInduction variable: loop index. zConsider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i][j] = b[i][j]; zRather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body.

© 2000 Morgan Kaufman Overheads for Computers as Components Cache analysis zLoop nest: set of loops, one inside other. zPerfect loop nest: no conditionals in nest. zBecause loops use large quantities of data, cache conflicts are common.

© 2000 Morgan Kaufman Overheads for Computers as Components Array conflicts in cache a[0,0] b[0,0] main memory cache

© 2000 Morgan Kaufman Overheads for Computers as Components Array conflicts, cont’d. zArray elements conflict because they are in the same line, even if not mapped to same location. zSolutions: ymove one array; ypad array.

© 2000 Morgan Kaufman Overheads for Computers as Components Performance optimization hints zUse registers efficiently. zUse page mode memory accesses. zAnalyze cache behavior: yinstruction conflicts can be handled by rewriting code, rescheudling; yconflicting scalar data can easily be moved; yconflicting array data can be moved, padded.

© 2000 Morgan Kaufman Overheads for Computers as Components Energy/power optimization zEnergy: ability to do work. yMost important in battery-powered systems. zPower: energy per unit time. yImportant even in wall-plug systems---power becomes heat.

© 2000 Morgan Kaufman Overheads for Computers as Components Measuring energy consumption zExecute a small loop, measure current: while (TRUE) a(); I

© 2000 Morgan Kaufman Overheads for Computers as Components Sources of energy consumption zRelative energy per operation (Catthoor et al): ymemory transfer: 33 yexternal I/O: 10 ySRAM write: 9 ySRAM read: 4.4 ymultiply: 3.6 yadd: 1

© 2000 Morgan Kaufman Overheads for Computers as Components Cache behavior is important zEnergy consumption has a sweet spot as cache size changes: ycache too small: program thrashes, burning energy on external memory accesses; ycache too large: cache itself burns too much power.

© 2000 Morgan Kaufman Overheads for Computers as Components Optimizing for energy zFirst-order optimization: yhigh performance = low energy. zNot many instructions trade speed for energy.

© 2000 Morgan Kaufman Overheads for Computers as Components Optimizing for energy, cont’d. zUse registers efficiently. zIdentify and eliminate cache conflicts. zModerate loop unrolling eliminates some loop overhead instructions. zEliminate pipeline stalls. zInlining procedures may help: reduces linkage, but may increase cache thrashing.

© 2000 Morgan Kaufman Overheads for Computers as Components Optimizing for program size zGoal: yreduce hardware cost of memory; yreduce power consumption of memory units. zTwo opportunities: ydata; yinstructions.

© 2000 Morgan Kaufman Overheads for Computers as Components Data size minimization zReuse constants, variables, data buffers in different parts of code. yRequires careful verification of correctness. zGenerate data using instructions.

© 2000 Morgan Kaufman Overheads for Computers as Components Reducing code size zAvoid function inlining. zChoose CPU with compact instructions. zUse specialized instructions where possible.

© 2000 Morgan Kaufman Overheads for Computers as Components Code compression zUse statistical compression to reduce code size, decompress on-the-fly: CPU decompressor table cache main memory LDR r0,[r4]