Download presentation
Presentation is loading. Please wait.
2
EC6703 EMBEDDED AND REAL TIME SYSTEMS
3
UNIT II EMBEDDED COMPUTING PLATFORM DESIGN
The CPU Bus Memory devices and systems Designing with computing platforms Consumer electronics architecture Platform-level performance analysis Components for embedded programs Models of programs- Assembly, linking and loading – compilation techniques Program level performance analysis Software performance optimization Program level energy and power analysis and optimization Analysis and optimization of program size- Program validation and testing.
4
In this topic, we concentrate on bus-based computer systems created using microprocessors, I/O devices, and memory components. The microprocessor is an important element of the embedded computing system, but it cannot do its job without memories and I/O devices. We need to understand how to interconnect microprocessors and devices using the CPU bus.
5
CPU BUS The CPU bus, which forms the backbone of the hardware system.
A computer system encompasses much more than the CPU; it also includes memory and I/O devices. The bus is the mechanism by which the CPU communicates with memory and devices. A bus is, at a minimum, a collection of wires, but the bus also defines a protocol by which the CPU, memory, and devices communicate. One of the major roles of the bus is to provide an interface to memory , I/O devices.
6
Bus Protocols The four cycles are described below.
The basic building block of most bus protocols is the four-cycle handshake, illustrated in Figure The four cycles are described below. 1. Device 1 raises its output to signal an enquiry, which tells device 2 that it should get ready to listen for data. 2. When device 2 is ready to receive, it raises its output to signal an acknowledgment. At this point, devices 1 and 2 can transmit or receive. 3. Once the data transfer is complete, device 2 lowers its output, signaling that it has received the data. 4. After seeing that ack has been released, device 1 lowers its output. At the end of the handshake, both handshaking signals are low, just as they were at the start of the handshake.
7
A typical microprocessor bus.
The term bus is used in two ways. The most basic use is as a set of related wires, such as address wires. However, the term may also mean a protocol for communicating between components. The fundamental bus operations are reading and writing.
8
The behavior of a bus is most often specified as a timing diagram.
A timing diagram shows how the signals on a bus vary over time.
9
Bus multiplexing CPU adrs device data data enable Adrs enable
10
DMA Standard bus transactions require the CPU to be in the middle of every read and write transaction. However, there are certain types of data transfers in which the CPU does not need to be involved. Direct memory access (DMA) is a bus operation that allows reads and writes not controlled by the CPU. A DMA transfer is controlled by a DMA controller, which requests control of the bus from the CPU
11
Bus mastership By default, CPU is bus master and initiates transfers.
Direct memory access (DMA) performs data transfers without executing instructions. CPU sets up transfer. DMA engine fetches, writes. DMA controller is a separate unit. Bus mastership By default, CPU is bus master and initiates transfers. DMA must become bus master to perform its work. CPU can’t use bus while DMA operates. Bus mastership protocol: Bus request. Bus grant.
12
DMA operation CPU sets DMA registers for start address, length.
DMA status register controls the unit. Once DMA is bus master, it transfers automatically. May run continuously until complete. May use every nth bus cycle.
13
Bus transfer sequence diagram
14
System bus configurations
Multiple busses allow parallelism: Slow devices on one bus. Fast devices on separate bus. A bridge connects two busses. CPU slow device memory high-speed device bridge
15
ARM AMBA bus Since the ARM CPU is manufactured by many different vendors, the bus provided off-chip can vary from chip to chip. ARM has created a separate bus specification for single-chip systems. The AMBA bus [ARM99A] supports CPUs, memories, and peripherals integrated in a system-on-silicon. Two varieties: AHB (AMBA high-performance bus) supports pipelining, burst transfers, split transactions, and multiple bus masters.. APB (AMBA peripherals bus) is simple, lower-speed, lower cost. All devices are slaves on APB.
18
Memory components Several types of memory:
DRAM. SRAM. Flash. Each type of memory comes in varying: Capacities. Widths.
19
Random-access memory Dynamic RAM is dense, requires refresh.
Random-access memories can be both read and written. They are called random access because, unlike magnetic disks, addresses can be read in any order. Dynamic RAM is dense, requires refresh. Synchronous DRAM is dominant type. SDRAM uses clock to improve performance, pipeline memory accesses. DDR(double data rate) SDRAMs Static RAM is faster, less dense, consumes more power. For PCs, SIMMs(single in-line memory modules), DIMMs are used.
20
SDRAM operation
21
Read-only memory ROM may be programmed at factory.
Flash is dominant form of field-programmable ROM. Electrically erasable, must be block erased. Random access, but write/erase is much slower than read. NOR flash is more flexible. NAND flash is more dense.
22
Flash memory Non-volatile memory. Random access for read. To write:
Flash can be programmed in-circuit. Random access for read. To write: Erase a block to 1. Write bits to 0.
23
Flash writing Write is much slower than read.
1.6 ms write, 70 ns read. Blocks are large (approx. 1 Mb). Writing causes wear that eventually destroys the device. Modern lifetime approx. 1 million writes.
24
Types of flash NOR: NAND:
Word-accessible read. Erase by blocks. NAND: Read by pages (512-4K bytes). NAND is cheaper, has faster erase, sequential access times.
25
I/O DEVICES Timers and Counters ADC / DAC Key boards / Pads
LEDs, Displays, Touchscreen
27
Designing with computing platforms (Microprocessors)
In this topic , we are going to see., How to create an initial working embedded system How to ensure that the system works properly. by considering possible architectures for embedded computing systems. by studying techniques for designing the hardware components of embedded systems. To describes the use of the PC as an embedded computing platform.
28
System architectures The architecture of an embedded computing system is the blueprint for implementing that system—it tells you what components you need and how you put them together Architectures and components: software; hardware. Some software is very hardware-dependent.
29
Hardware platform architecture
Contains several elements: CPU bus memory I/O devices: networking, sensors, actuators, etc. How big/fast much each one be?
30
Software architecture
Functional description must be broken into pieces: division among people conceptual organization performance testability maintenance Mixing together different types of functionality into a single code module leads to spaghetti code, which has poorly structured control flow, excessive use of global data, and generally unreliable programs.
31
Hardware and software architectures
Hardware and software are intimately related: software doesn’t run without hardware; how much hardware you need is determined by the software requirements: speed; memory.
32
Evaluation boards Designed by CPU manufacturer or others.
Includes CPU, memory, some I/O devices. May include prototyping section. CPU manufacturer often gives out evaluation board netlist---can be used as starting point for your custom board design.
33
Adding logic to a board Programmable logic devices (PLDs) provide low/medium density logic. Field-programmable gate arrays (FPGAs) provide more logic and multi-level logic. Application-specific integrated circuits (ASICs) are manufactured for a single purpose.
34
The PC as a platform Advantages: Disadvantages:
Cheap (for Industries) and easy to get; rich and familiar software environment. Disadvantages: requires a lot of hardware resources; not well-adapted to real-time. It is larger, more power hungry, and more expensive than a custom hardware platform would be
35
Typical PC hardware platform
CPU CPU bus memory DMA controller timers bus interface high-speed bus low-speed bus device intr ctrl ■ ROM holds the boot program. ■ RAM is used for program storage. ■ PCI: standard for high-speed interfacing 33 or 66 MHz. PCI Express. ■ USB (Universal Serial Bus), Fire wire (IEEE 1394): relatively low-cost serial interface with high speed.
36
Software elements IBM PC uses BIOS (Basic I/O System) to implement low-level functions: boot-up minimal device drivers. BIOS has become a generic term for the lowest-level system software.
37
Example for Single Chip System: StrongARM(SA-1100)
StrongARM system includes: CPU chip (3.686 MHz clock) system control module ( kHz clock). Real-time clock; operating system timer general-purpose I/O; interrupt controller; power manager controller; reset controller.
38
Debugging embedded systems
Challenges: target system may be hard to observe; target may be hard to control; may be hard to generate realistic inputs; setup sequence may be complex.
39
Host/target design Use a host system to prepare software for target system: target system host system serial line
40
Host-based tools Cross compiler: Cross debugger:
compiles code on host for target system.(i.e., A cross-compiler is a compiler that runs on one type of machine but generates code for another.) Cross debugger: displays target state, allows target system to be controlled.
41
Software debuggers A monitor program residing on the target provides basic debugger functions. Debugger should have a minimal footprint in memory. User program must be careful not to destroy debugger program, but , should be able to recover from some damage caused by user code.
42
Breakpoints A breakpoint allows the user to stop execution, examine system state, and change state. Replace the breakpointed instruction with a subroutine call to the monitor program.
43
ARM breakpoints 0x400 MUL r4,r6,r6 0x400 MUL r4,r6,r6
0x404 ADD r2,r2,r4 0x408 ADD r0,r0,#1 0x40c B loop uninstrumented code 0x400 MUL r4,r6,r6 0x404 ADD r2,r2,r4 0x408 ADD r0,r0,#1 0x40c BL bkpoint code with breakpoint
44
Breakpoint handler actions
Save registers. Allow user to examine machine. Before returning, restore system state. Safest way to execute the instruction is to replace it and execute in place. Put another breakpoint after the replaced breakpoint to allow restoring the original breakpoint.
45
In-circuit emulators A microprocessor in-circuit emulator is a specially-instrumented microprocessor. Allows you to stop execution, examine CPU state, modify registers.
46
Boundary scan Simplifies testing of multiple chips on a board.
Registers on pins can be configured as a scan chain. Used for debuggers, in-circuit emulators.
47
How to exercise code Run on host system. Run on target system.
Run in instruction-level simulator. Run on cycle-accurate simulator. Run in hardware/software co-simulation environment.
48
Debugging real-time code
Bugs in drivers can cause non-deterministic behavior in the foreground problem. Bugs may be timing-dependent.
50
(Logic analyzers) CONSUMER ELECTRONICS ARCHITECTURE
A logic analyzer is an array of low-grade oscilloscopes:
51
Logic analyzer architecture
Latch The analyzer can sample many different signals simultaneously (tens to hundreds) but can display only 0, 1, or changing values for each
52
Logic analyzer architecture
The logic analyzer records the values on the signals into an internal memory and then displays the results on a display once the memory is full or the run is aborted. A typical logic analyzer can acquire data in either of two modes that are typically called state and timing modes. State and timing mode represent different ways of sampling the values. Timing mode uses an internal clock that is fast enough to take several samples per clock period in a typical system. State mode, on the other hand, uses the system’s own clock to control sampling, so it samples each signal only once per clock cycle. As a result, timing mode requires more memory to store a given number of system clock cycles.
53
Logic analyzer - Operation
The system’s data signals are sampled at a latch within the logic analyzer; the latch is controlled by either the system clock or the internal logic analyzer sampling clock, depending on whether the analyzer is being used in state or timing mode. Each sample is copied into a vector memory under the control of a state machine. The latch, timing circuitry, sample memory, and controller must be designed to run at high speed since several samples per system clock cycle may be required in timing mode. After the sampling is complete, an embedded microprocessor takes over to control the display of the data captured in the sample memory. Logic analyzers typically provide a number of formats for viewing data. One format is a timing diagram format
54
PLATFORM-LEVEL PERFORMANCE ANALYSIS
55
System-level performance analysis
Performance depends on all the elements of the system: CPU. Cache. Bus. Main memory. I/O device. In this section, we will develop some basic techniques for analyzing the performance of bus-based systems.
56
■ transfer over the bus to the cache; and
We want to move data from memory to the CPU to process it. To get the data from memory to the CPU we must: ■ read from the memory; ■ transfer over the bus to the cache; and ■ transfer from the cache to the CPU. The time required to transfer from the cache to the CPU is included in the instruction execution time, but the other two times are not. The most basic measure of performance we are interested in is bandwidth → System level data flows and performance
57
Bandwidth as performance
Bandwidth(the rate at which we can move data) applies to several components: Memory. Bus. CPU fetches. Different parts of the system run at different clock rates. Different components may have different widths (bus, memory). We have to make sure that we apply the right clock rate to each part of the performance estimate when we convert from clock cycles to seconds. Bandwidth questions often come up when we are transferring large blocks of data. Example
58
Bandwidth and data transfers
Considering the bandwidth provided by only one system component, the bus. Consider an image of 320 X 240 pixels, with each pixel composed of 3 bytes of data. Video frame: 320 x 240 x 3 = 230,400 bytes. Transfer in 1/30 sec. Transfer 1 byte/msec, leads to 0.23 sec per frame (230400msec). i.e, more than 1/30 sec Too slow. i.e, we have to increase the transfer rate by 7 times We can increase bandwidth in two ways: We can increase the clock rate of the bus ( 2Mhz) or we can increase the amount of data transferred per clock cycle (4bytes).
59
Bus bandwidth T: # bus cycles. P: time/bus cycle.
Total time for transfer: t = TP. D: data payload length. O1 + O2 = overhead O. O1 D O2 W Tbasic(N) = (D+O)N/W
60
Bus burst transfer bandwidth
T: # bus cycles. P: time/bus cycle. Total time for transfer: t = TP. D: data payload length. O1 + O2 = overhead O. 1 2 B O … W Tburst(N) = (BD+O)N/(BW)
61
Memory aspect ratios 16 M 64 M 8 M 8 1 4
62
Memory access times Memory component access times comes from chip data sheet. Page modes allow faster access for successive transfers on same page. If data doesn’t fit naturally into physical words: A = [(E/w)mod W]+1
63
Bus performance bottlenecks
Transfer 320 x 240 video 30 frames/sec = 612,000 bytes/sec. Is performance bottleneck bus or memory? CPU memory
64
Bus performance bottlenecks, cont’d.
Bus: assume 1 MHz bus, D=1, O=3: Tbasic = (1+3)612,000/2 = 1,224,000 cycles = sec. Memory: try burst mode B=4, width w=0.5. Tmem = (4*1+4)612,000/(4*0.5) = 2,448,000 cycles = sec.
65
Performance spreadsheet
66
Parallelism Speed things up by running several units at once.
DMA provides parallelism if CPU doesn’t need the bus: DMA + bus. CPU. Transfer with DMA
68
Components for embedded programs
In this section, we study in detail the process of programming embedded processors. The creation of embedded programs is at the heart of embedded system design. Embedded code must not only provide rich functionality, it must also often run at a required rate to meet system deadlines, fit into the allowed amount of memory, and meet power consumption requirements. Designing code that simultaneously meets multiple design constraints is a considerable challenge, but luckily there are techniques and tools that we can use to help us through the design process. we consider code for three structures or components that are commonly used in embedded software: The state machine, The circular buffer, The queue. State machines are well suited to reactive systems such as user interfaces; Circular buffers and Queues are useful in digital signal processing.
69
Software State Machine
When inputs appear intermittently rather than as periodic samples, it is often convenient to think of the system as reacting to those inputs. The reaction of most systems can be characterized in terms of the input received and the current state of the system. This leads naturally to a finite-state machine State machine keeps internal state as a variable, changes state based on inputs. Uses: control-dominated code; reactive systems.
70
State machine example (Seat belt controller)
The controller’s job is to turn on a buzzer if a person sits in a seat and does not fasten the seat belt within a fixed amount of time. This system has three inputs and one output. The inputs are a sensor for the seat to know when a person has sat down, a seat belt sensor that tells when the belt is fastened, and a timer that goes off when the required time interval has elapsed. The output is the buzzer. idle buzzer seated belted no seat/- seat/timer on no belt and no timer/- no belt/timer on belt/- belt/ buzzer off Belt/buzzer on no seat/
71
C implementation #define IDLE 0 #define SEATED 1 #define BELTED 2
#define BUZZER 3 switch (state) { case IDLE: if (seat) { state = SEATED; timer_on = TRUE; } break; case SEATED: if (belt) state = BELTED; else if (timer) state = BUZZER; …
72
Circular buffer Commonly used in signal processing:
The circular buffer is a data structure that lets us handle streaming data in an efficient way. Commonly used in signal processing: new data constantly arrives; each datum(factual information derived from measurement) has a limited lifetime. Use a circular buffer to hold the data stream. EX: FIR filter For each sample, the filter must emit one output that depends on the values of the last n inputs.
73
Circular buffer Data stream x1 x2 x3 x4 x5 x6 t1 t2 t3 x5 x1 x2 x6 x3
74
Circular buffers Indexes locate currently used data, current input data: d5 d1 input use d2 d2 input d3 d3 d4 d4 use time t1+1 time t1
75
Circular buffer implementation: FIR filter
int circ_buffer[N], circ_buffer_head = 0; int c[N]; /* coefficients */ … int ibuf, ic; for (f=0, ibuff=circ_buff_head, ic=0; ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++) f = f + c[ic]*circ_buffer[ibuf];
76
Queues Queues are also used in signal processing and event processing.
Queues are used whenever data may arrive and depart at somewhat unpredictable times or when variable amounts of data may arrive. A queue is often referred to as an elastic buffer, which holds data that arrives irregularly. One way to build a queue is with a linked list. This approach allows the queue to grow to an arbitrary size. Another way to design the queue is to use an array to hold all the data.
77
Buffer-based queues(to manage interrupt-driven data;)
#define Q_SIZE 32 #define Q_MAX (Q_SIZE-1) int q[Q_MAX], head, tail; void initialize_queue() { head = tail = 0; } void enqueue(int val) { if (((tail+1)%Q_SIZE) == head) error(); q[tail]=val; if (tail == Q_MAX) tail = 0; else tail++; } int dequeue() { int returnval; if (head == tail) error(); returnval = q[head]; if (head == Q_MAX) head = 0; else head++; return returnval; }
79
Models of programs In this section , we develop models for programs that are more general than source code like ALP, C …and so on... Source code is not a good representation for programs: clumsy; leaves much information implicit. Compilers derive intermediate representations to manipulate and optimize the program. Our fundamental model for programs is the control/data flow graph (CDFG).
80
Data flow graph DFG: data flow graph. Does not represent control.
Models basic block: code with no entry or exit. Describes the minimal ordering requirements on operations.
81
Single assignment form
x = a + b; y = c - d; z = x * y; y = b + d; original basic block x = a + b; y = c - d; z = x * y; y1 = b + d; single assignment form
82
Data flow graph x = a + b; y = c - d; z = x * y; y1 = b + d;
single assignment form + - * DFG a b c d z x y y1
83
DFGs and partial orders
a+b, c-d; b+d x*y Can do pairs of operations in any order. a b c d + - y x * + z y1
84
Control-data flow graph
CDFG: represents control and data. Uses data flow graphs as components. Two types of nodes: decision; data flow.
85
Data flow node x = a + b; y = c + d Encapsulates a data flow graph:
Write operations in basic block form for simplicity. x = a + b; y = c + d
86
Control cond T v1 v4 value v3 v2 F Equivalent forms
87
CDFG example if (cond1) bb1(); else bb2(); bb3(); switch (test1) {
case c1: bb4(); break; case c2: bb5(); break; case c3: bb6(); break; } T cond1 bb1() F bb2() bb3() test1 c3 c1 c2 bb4() bb5() bb6()
88
for loop for (i=0; i<N; i++) loop_body(); for loop i=0;
while (i<N) { loop_body(); i++; } equivalent i=0 i<N F T loop_body()
90
Assembly , linking, loading
Assembly and linking are the last steps in the compilation process. they turn a list of instructions into an image of the program’s bits in memory. Loading actually puts the program in memory so that it can be executed. compiler HLL assembly assembler linker Executable Binary loader Object Code
91
Assembly , linking, loading
As the figure shows, most compilers do not directly generate machine code, but instead create the instruction-level program in the form of human-readable assembly language. The assembler’s job is to translate symbolic assembly language statements into bit-level representations of instructions known as object code. A linker allows a program to be stitched together out of several smaller pieces. The linker operates on the object files created by the assembler and modifies the assembled code to make the necessary links between files. The linker, which produces an executable binary file. That file may not necessarily be located in the CPU’s memory, however, unless the linker happens to create the executable directly in RAM. The program that brings the program into memory for execution is called a loader.
92
Assemblers Major tasks: Generally one-to-one translation.
generate binary for symbolic instructions; translate labels into addresses; handle pseudo-ops (data, etc.). Generally one-to-one translation. Assembly labels: ORG 100 label1 ADR r4,c
93
Pseudo-operations Pseudo-ops do not generate instructions:
ORG sets program location. EQU generates symbol table entry without advancing PLC. Data statements define data blocks.
94
Linking Combines several object modules into a single executable module. Jobs: put modules in order; resolve labels across modules.
95
Dynamic linking Some operating systems link modules dynamically at run time: shares one copy of library among all executing programs; allows programs to be updated with new versions of libraries.
96
COMPILATION TECHNIQUES
It is useful to understand how a high-level language program is translated into instructions. Understanding how the compiler works can help you know when you cannot rely on the compiler. Next, because many applications are also performance sensitive, understanding how code is generated can help you meet your performance goals, either by writing high-level code that gets compiled into the instructions you want or by recognizing when you must write your own assembly code. Compilation combines translation and optimization.
97
Compilation Compiler determines quality of code: use of CPU resources;
memory access scheduling; code size.
98
Basic compilation phases
HLL parsing, symbol table machine-independent optimizations machine-dependent Optimizations assembly The high-level language program is parsed to break it into statements and expressions. In addition, a symbol table is generated, which includes all the named objects in the program. Simplifying arithmetic expressions is one example of a machine-independent optimization. Instruction –level optimization and code generation
99
Statement translation and optimization
Source code is translated into intermediate form such as CDFG. CDFG is transformed/optimized. CDFG is translated into instructions with optimization decisions. Instructions are further optimized.
100
Compiling an Arithmetic expressions
DFG expression a*b + 5*(c-d) W,X,Y,Z are temp variables a b c d * - 5 X W * Y + Z
101
Compilation of Arithmetic expressions, cont’d.
b c d ADR r4,a MOV r1,[r4] ADR r4,b MOV r2,[r4] ADD r3,r1,r2 1 2 * - 5 3 ADR r4,c MOV r1,[r4] ADR r4,d MOV r5,[r4] SUB r6,r4,r5 * 4 + MUL r7,r6,#5 ADD r8,r7,r3 DFG code
102
Similarly for Control code generation
if (a+b > 0) x = 5; else x = 7; a+b>0 x=5 x=7
103
Control code generation, cont’d.
ADR r5,a LDR r1,[r5] ADR r5,b LDR r2,b ADD r3,r1,r2 BLE label3 1 2 a+b>0 x=5 3 LDR r3,#5 ADR r5,x STR r3,[r5] B stmtent x=7 LDR r3,#7 ADR r5,x STR r3,[r5] stmtent ...
104
Procedure linkage Need code to:
Another major code generation problem is the creation of procedures Need code to: call and return; pass parameters and results. Parameters and returns are passed on stack. Procedures with few parameters may use registers.
105
Procedure stacks growth proc1 proc1(int a) { proc2(5); } FP
frame pointer (defines the end of the Last frame) proc2 5 accessed relative to SP SP stack pointer (defines the end of the current frame) When a new procedure is called, the sp and fp are modified to push another frame onto the stack.
106
ARM procedure linkage APCS (ARM Procedure Call Standard):
r0-r3 pass parameters into procedure. Extra parameters are put on stack frame. r0 holds return value. r4-r7 hold register values. r11 is frame pointer, r13 is stack pointer. r10 holds limiting address on stack size to check for stack overflows.
107
Data structures The compiler must also translate references to data structures into references to raw memories. In general, this requires address computations. Different types of data structures use different data layouts. Some offsets into data structure can be computed at compile time, others must be computed at run time. An array element must in general be computed at run time, since the array index may change. Let us first consider one-dimensional arrays:
108
One-dimensional arrays
C array name points to 0th element: a a[0] a[1] = *(a + 1) a[2]
109
Two-dimensional arrays
Column-major layout: M ... N a[0,0] a[0,1] ... a[1,0] a[1,1] = a[i*M+j]
110
Structures Fields within structures are static offsets: aptr field1
4 bytes struct { int field1; char field2; } mystruct; struct mystruct a, *aptr = &a; *(aptr+4) field2
111
Using your compiler Understand various optimization levels (-O1, -O2, etc.) Look at mixed compiler/assembler output. Modifying compiler output requires care: correctness; loss of hand-tweaked code.
112
Interpreters and JIT(Just In-Time)compilers
Programs are not always compiled and then separately executed. In some cases, it may make sense to translate the program into instructions during execution. Two well-known techniques for on-the-fly translation are interpretation and just-in-time (JIT ) compilation. Interpreter: translates and executes program statements on-the-fly. An interpreter translates program statements one at a time. The interpreter sits between the program and the machine. The interpreter may or may not generate an explicit piece of code to represent the statement. Because the interpreter translates only a very small piece of the program at any given time, A small amount of memory is used to hold intermediate representations of the program.
113
Interpreters and JIT(Just In-Time)compilers
JIT compiler: compiles small sections of code into instructions during program execution. Eliminates some translation overhead. Often requires more memory. Best suited for Java environments A JIT compiler is somewhere between an interpreter and a stand-alone compiler. A JIT compiler produces executable code segments for pieces of the program. However, it compiles a section of the program (such as a function) only when it knows it will be executed. Unlike an interpreter, it saves the compiled version of the code so that the code does not have to be retranslated the next time it is executed. The JIT compiler usually generates machine code directly rather than building intermediate program representation data structures such as the CDFG.
115
Program design and analysis
Program-level performance analysis. Optimizing for: Execution time. Energy/power. Program size. Program validation and testing.
116
Program-level performance analysis
Need to understand performance in detail: Real-time behavior, not just typical. On complex platforms. Program performance ¹ CPU performance: Pipeline, cache are windows into program. We must analyze the entire program.
117
Complexities of analyzing program performance
The execution time of a program often varies with the input data values because those values select different execution paths in the program. - For example, loops Cache effects. The cache’s behavior depends in part on the data values input to the program. Instruction-level performance variations: Pipeline interlocks. Fetch times.
118
How to measure program performance
Simulate execution of the CPU (Simulator). Makes CPU state visible. Be careful for some microprocessor performance simulators are not 100% accurate, and simulation of I/O-intensive code may be difficult. Also measures execution time of program Measure on real CPU using timer. A timer connected to the microprocessor bus can be used to measure performance of executing sections of code. Requires modifying the program to control the timer. Measure on real CPU using logic analyzer. By measuring the start and stop times of a code segment Requires events visible on the pins.
119
Program performance metrics
Average-case execution time. Typically used in application programming. Worst-case execution time. A component in deadline satisfaction. Best-case execution time. Task-level interactions can cause best-case program behavior to result in worst-case system behavior.
120
Elements of program performance
Basic program execution time formula: execution time = program path + instruction timing Solving these problems independently helps simplify analysis. Easier to separate on simpler CPUs. Accurate performance analysis requires: Assembly/binary code. Execution platform.
121
Data-dependent paths in an if statement
if (a || b) { /* T1 */ if ( c ) /* T2 */ x = r*s+t; /* A1 */ else y=r+s; /* A2 */ z = r+s+u; /* A3 */ } else { if ( c ) /* T3 */ y = r-t; /* A4 */ a b c path T1=F, T3=F: no assignments 1 T1=F, T3=T: A4 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3
122
Paths in a loop for (i=0, f=0; i<N; i++) f = f + c[i] * x[i]; i=0
Loop exit N i<N Y f = f + c[i] * x[i] i = i + 1
123
Instruction timing Not all instructions take the same amount of time.
Once we know the execution path of the program, we have to measure the execution time of the instructions executed along that path. However , even ignoring cache effects, this technique is simplistic for the reasons summarized below. Not all instructions take the same amount of time. Multi-cycle instructions (RISC, Fixed length instruction) Fetches. Execution times of instructions are not independent. (many CPUs use register bypassing to speed up instruction sequences when the result of one instruction is used in the next instruction.) Pipeline interlocks. Cache effects. Execution times may vary with operand value. This is clearly true of floating-point instructions in which a different number of iterations may be required to calculate the result Some multi-cycle integer operations.
124
Measurement-driven performance analysis
Not so easy as it sounds: Must actually have access to the CPU. Must know data inputs that give worst/best case performance. Must make state visible. Still an important method for performance analysis.
125
Feeding the program Need to know the desired input values.
May need to write software scaffolding to generate the input values. Software scaffolding may also need to examine outputs to generate feedback-driven inputs.
126
Trace-driven measurement
Instrument the program. Save information about the path. Requires modifying the program. Trace files are large. Widely used for cache analysis.
127
Physical measurement In-circuit emulator allows tracing.
Affects execution timing. Logic analyzer can measure behavior at pins. Address bus can be analyzed to look for events. Code can be modified to make events visible. Particularly important for real-world input streams.
128
CPU simulation Some simulators are less accurate.
Cycle-accurate simulator provides accurate clock-cycle timing. Simulator models CPU internals. Simulator writer must know how CPU works.
129
SimpleScalar FIR filter simulation
int x[N] = {8, 17, … }; int c[N] = {1, 2, … }; main() { int i, k, f; for (k=0; k<COUNT; k++) for (i=0; i<N; i++) f += c[i]*x[i]; } N total sim cycles sim cycles per filter execution 100 25854 259 1,000 155759 156 1,0000 145
130
Performance optimization motivation
Embedded systems must often meet deadlines. Faster may not be fast enough. Need to be able to analyze execution time. Worst-case, not typical. Need techniques for reliably improving execution time.
131
Programs and performance analysis
Best results come from analyzing optimized instructions, not high-level language code: non-obvious translations of HLL statements into instructions; code may move; cache effects are hard to predict.
132
Software performance optimization
133
Loop optimizations Loops are important targets for optimization because programs with loops tend to spend a lot of time executing those loops. There are three important techniques in optimizing loops: code motion, induction variable elimination, and Strength reduction (x*2 -> x<<1).
134
Code motion Code motion lets us move unnecessary code out of a loop.
for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; We can avoid N *M-1 unnecessary executions of this statement by moving it before the loop, as shown in the figure 2.
135
Induction variable elimination
An induction variable is a variable whose value is derived from the loop iteration variable’s value. The compiler often introduces induction variables to help it implement the loop. Induction variable: A nested loop is a good example of the use of induction variables. Consider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i,j] = b[i,j]; Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body.
136
The compiler uses induction variables to help it address the arrays.
Let us rewrite the loop in C using induction variables and pointers. In the above code, zptr and bptr are pointers to the heads of the z and b arrays and zbinduct is the shared induction variable.
137
Strength reduction Strength reduction helps us reduce the cost of a loop iteration. Consider the following assignment: y = x * 2; In integer arithmetic, we can use a left shift rather than a multiplication by 2 (as long as we properly keep track of overflows). If the shift is faster than the multiply, we probably want to perform the substitution.
138
Performance optimization hints
Use registers efficiently. Use page mode memory accesses. Analyze cache behavior: instruction conflicts can be handled by rewriting code, rescheudling; conflicting scalar data can easily be moved; conflicting array data can be moved, padded.
139
PROGRAM-LEVEL ENERGY AND POWER ANALYSIS AND OPTIMIZATION
140
Energy/power optimization
Energy: ability to do work. Most important in battery-powered systems. Power: energy per unit time. Important even in wall-plug systems---power becomes heat.
141
Opportunities for saving power
■ We may be able to replace the algorithms with others that do things in clever ways that consume less power. ■ Memory accesses are a major component of power consumption in many applications. By optimizing memory accesses we may be able to significantly reduce power. ■ We may be able to turn off parts of the system—such as subsystems of the CPU, chips in the system, and so on—when we do not need them in order to save power.
142
Measuring energy consumption
Execute a small loop, measure current: Figure executes the code under test over and over in a loop. By measuring the current flowing into the CPU, we are measuring the power consumption of the complete loop, including both the body and other code. By separately measuring the power consumption of a loop with no body
143
Sources of energy consumption
Relative energy per operation (Catthoor et al): memory transfer: 33 external I/O: 10 SRAM write: 9 SRAM read: 4.4 multiply: 3.6 add: 1
144
Cache behavior is important
Energy consumption has a sweet spot as cache size changes: cache too small: program thrashes, burning energy on external memory accesses; cache too large: cache itself burns too much power.
145
Optimizing for energy Use registers efficiently.
Identify and eliminate cache conflicts. Moderate loop unrolling eliminates some loop overhead instructions. Eliminate pipeline stalls. Inlining procedures may help: reduces linkage, but may increase cache thrashing.
146
Efficient loops General rules: Don’t use function calls.
Keep loop body small to enable local repeat (only forward branches). Use unsigned integer for loop counter. Use <= to test loop counter. Make use of compiler---global optimization, software pipelining.
147
Program validation and testing
But does it work? Concentrate here on functional verification. Major testing strategies: Black box doesn’t look at the source code. Clear box (white box) does look at the source code.
148
Clear-box testing Examine the source code to determine whether it works: Can you actually exercise a path? Do you get the value you expect along a path? Testing procedure: Controllability: Provide program with inputs. Execute. Observability: examine outputs.
149
How much testing is enough?
Exhaustive testing is impractical. One important measure of test quality---bugs escaping into field. Good organizations can test software to give very low field bug report rates. Error injection measures test quality: Add known bugs. Run your tests. Determine % injected bugs that are caught.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.