Chapter 3 CPUs 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)
CPUs-1 Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-2 I/O devices An embedded system usually includes some input/output devices A typical I/O interface to CPU: CPU status reg data reg mechanism
CPUs-3 Application: 8251 UART Universal asynchronous receiver transmitter (UART): provides serial communication 8251 functions are usually integrated into standard PC interface chip Allows many communication parameters Baud (bit) rate: e.g chars/sec Number of bits per character Parity/no parity Even/odd parity Length of stop bit (1, 1.5, 2 bits)
CPUs CPU interface CPU 8251 status (8 bit) data (8 bit) serial port xmit/ rcv time bit 0bit 1bit n-1 no char start stop...
CPUs-5 Programming I/O Two types of instructions can support I/O: special-purpose I/O instructions memory-mapped load/store instructions Intel x86 provides in, out instructions Most other CPUs use memory-mapped I/O MM I/O provide address for I/O registers
CPUs-6 ARM memory-mapped I/O Define location for device: DEV1 EQU 0x1000 Read/write code: LDR r1,#DEV1; set up device adrs LDR r0,[r1] ; read DEV1 LDR r0,#8 ; set up value to write STR r0,[r1] ; write value to device
CPUs-7 SHARC memory mapped I/O Device must be in external memory space (above 0x400000). Use DM to control access: I0 = 0x400000; M0 = 0; R1 = DM(I0,M0);
CPUs-8 Peek and poke Traditional C interfaces: int peek(char *location) { return *location; } void poke(char *location, char newval) { (*location) = newval; }
CPUs-9 Busy/wait output Simplest way to program device Use instructions to test when device is ready #define OUT_CHAR 0x1000 // device data register #define OUT_STATUS 0x1001 // device status register current_char = mystring; while (*current_char != ‘\0’) { poke(OUT_CHAR,*current_char); while (peek(OUT_STATUS) != 0); // busy waiting current_char++; }
CPUs-10 Simultaneous busy/wait input and output while (TRUE) { /* read */ while (peek(IN_STATUS) == 0); achar = (char)peek(IN_DATA); /* write */ poke(OUT_DATA,achar); poke(OUT_STATUS,1); while (peek(OUT_STATUS) != 0); }
CPUs-11 Interrupt I/O Busy/wait is very inefficient CPU can’t do other work while testing device Hard to do simultaneous I/O Interrupts allow a device to change the flow of control in the CPU Causes subroutine call to handle device CPU status reg data reg mechanism PC intr request intr ack data/address IR
CPUs-12 Interrupt behavior Based on subroutine call mechanism. Interrupt forces next instruction to be a subroutine call to a predetermined location. Return address is saved to resume executing foreground program.
CPUs-13 Interrupt physical interface CPU and device are connected by CPU bus CPU and device handshake: device asserts interrupt request CPU checks the interrupt request line at the beginning of each instruction cycle CPU asserts interrupt acknowledge when it can handle the interrupt CPU fetches the next instruction from the interrupt handler routine
CPUs-14 Example: character I/O handlers void input_handler() { achar = peek(IN_DATA); gotchar = TRUE; poke(IN_STATUS,0); } void output_handler() { }
CPUs-15 Example: interrupt-driven main program main() { while (TRUE) { if (gotchar) { poke(OUT_DATA,achar); poke(OUT_STATUS,1); gotchar = FALSE; }
CPUs-16 Example: interrupt I/O with buffers Queue for characters: headtail headtail a
CPUs-17 Buffer-based input handler void input_handler() { char achar; if (full_buffer()) error = 1; else { achar = peek(IN_DATA); add_char(achar); } poke(IN_STATUS,0); if (nchars == 1) { poke(OUT_DATA,remove_char(); poke(OUT_STATUS,1); } }
CPUs-18 Debugging interrupt code What if you forget to change registers? Foreground program can exhibit mysterious bugs. Bugs will be hard to repeat---depend on interrupt timing.
CPUs-19 Priorities and Vectors Two mechanisms allow us to make interrupts more specific: Priorities determine what interrupt gets CPU first Vectors determine what code (handler routine) is called for each type of interrupt Mechanisms are orthogonal: most CPUs provide both
CPUs-20 Prioritized interrupts CPU device 1device 2device n L1 L2.. Ln interrupt acknowledge
CPUs-21 Interrupt prioritization Masking: interrupt with priority lower than current priority is not recognized until pending interrupt is complete Non-maskable interrupt (NMI): highest-priority, never masked Often used for power-down
CPUs-22 Interrupt Vectors Allow different devices to be handled by different code Require additional vector line from device to CPU handler 0 handler 1 handler 2 handler 3 Interrupt Vector Table CPU Device int reqack Vector
CPUs-23 Generic interrupt mechanism Assume priority selection is handled before this point. intr? N Y N ignore Y ack vector? Y Y N timeout? Y bus error call table[vector] intr priority > current priority? continue execution
CPUs-24 Interrupt sequence CPU acknowledges request Device sends vector CPU calls handler Software processes request CPU restores state to foreground program
CPUs-25 Sources of interrupt overhead Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties
CPUs-26 ARM interrupts ARM7 supports two types of interrupts: Fast interrupt requests (FIQs) Interrupt requests (IRQs) FIO takes priority over IRQ Interrupt table starts at location 0
CPUs-27 ARM interrupt procedure CPU actions: Save PC; copy CPSR to SPSR. Force bits in CPSR to record interrupt. Force PC to vector. Handler responsibilities: Restore proper PC. Restore CPSR from SPSR. Clear interrupt disable flags.
CPUs-28 ARM interrupt latency Worst-case latency to respond to interrupt is 27 cycles: Two cycles to synchronize external request. Up to 20 cycles to complete current instruction. Three cycles for data abort. Two cycles to enter interrupt handling state.
CPUs-29 SHARC interrupt structure Interrupts are vectored and prioritized. Priorities are fixed: reset highest, user SW interrupt 3 lowest. Vectors are also fixed. Vector is offset in vector table. Table starts at 0x20000 in internal memory, 0x40000 in external memory.v
SHARC interrupt sequence Start: must be executing or IDLE/IDLE Output appropriate interrupt vector address. 2. Push PC value onto PC stack. 3. Set bit in interrupt latch register. 4. Set IMASKP to current nesting state.
SHARC interrupt return Initiated by RTI instruction. 1. Return to address at top of PC stack. 2. Pop PC stack. 3. Pop status stack if appropriate. 4. Clear bits in interrupt latch register and IMASKP.
SHARC interrupt performance Three stages of response: 1 cycle: synchronization and latching 1 cycle: recognition 2 cycles: branching to vector Total latency: 3 cycles. Multiprocessor vector interrupts have 6 cycle latency.
CPUs-33 Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-34 Supervisor mode May want to provide protective barriers between programs. e.g., avoid memory corruption Need supervisor mode to manage the various programs SHARC does not have a supervisor mode
CPUs-35 ARM supervisor mode Use SWI instruction to enter supervisor mode, similar to subroutine: SWI CODE_1 Sets PC to 0x08 Argument to SWI is passed to supervisor mode code to request various services Saves CPSR in SPSR
CPUs-36 Exception Exception: internally detected error Exceptions are synchronous with instructions but unpredictable Build exception mechanism on top of interrupt mechanism Exceptions are usually prioritized and vectorized A single instruction may generate more than one exception
CPUs-37 Trap Trap (software interrupt): an exception generated by an instruction Call supervisor mode ARM uses SWI instruction for traps SHARC offers three levels of software interrupts. Called by setting bits in IRPTL register
CPUs-38 Co-processor Co-processor: added function unit that is called by instruction e.g. floating-point operations A co-processor instruction can cause trap and be handled by software (if no such co-processor exists) ARM allows up to 16 co-processors
CPUs-39 Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-40 Caches and CPUs CPU cache controller cache main memory data address data address Memory access speed is falling further and further behind than CPU Cache: reduce the speed gap
CPUs-41 Cache operation May have caches for: instructions data data + instructions (unified) Memory access time is no longer deterministic Cache hit: required location is in cache Cache miss: required location is not in cache Working set: set of locations used by program in a time interval.
CPUs-42 Types of cache misses Compulsory (cold): location has never been accessed. Capacity: working set is too large Conflict: multiple locations in working set map to same cache entry.
CPUs-43 Memory system performance Cache performance benefits: Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time. Sequential accesses are faster after first access. h = cache hit rate. t cache = cache access time, t main = main memory access time. Average memory access time: t av = ht cache + (1-h)t main
CPUs-44 Multi-level caches h 1 = cache hit rate. h 2 = rate for miss on L1, hit on L2. Average memory access time: t av = h 1 t L1 + (h 2 -h 1 )t L2 + (1- h 2 -h 1 )t main CPU L1 cache L2 cache
CPUs-45 Replacement policies Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies: Random. Least-recently used (LRU).
CPUs-46 Cache organizations Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented) Direct-mapped: each memory location maps onto exactly one cache entry N-way set-associative: each memory location can go into one of n sets
CPUs-47 Direct-mapped cache valid = tagindexoffset hit value tagdata 10xabcdbyte byte byte... byte cache block
CPUs-48 Direct-mapped cache locations Many locations map onto the same cache block. Conflict misses are easy to generate: Array a[] uses locations 0, 1, 2, … Array b[] uses locations 1024, 1025, 1026, … Operation a[i] + b[i] generates conflict misses if locations 0 and 1024 are mapped to the same block in the cache. Write operations: Write-through: immediately copy write to main memory Write-back: write to main memory only when location is removed from cache
CPUs-49 Set-associative cache A set of direct-mapped caches: Set 1Set 2Set n... hit data
CPUs-50 Memory Management Units Memory management unit (MMU) translates addresses MMU are not common in embedded system as it hardly has a secondary storage CPU main memory MMU logical address physical address secondary storage swapping data
CPUs-51 Memory management tasks Allows programs to move in physical memory during execution. Allows virtual memory: memory images kept in secondary storage; images returned to main memory on demand during execution. Page fault: request for location not resident in memory, which generates an exception
CPUs-52 Address translation Requires some sort of register/table to allow arbitrary mappings of logical to physical addresses Two basic schemes: segmented paged Segmentation and paging can be combined (x86)
CPUs-53 Segment address translation segment base addresslogical address range check physical address + range error segment lower bound segment upper bound to memory from CPU
CPUs-54 Page address translation pageoffset pageoffset page i base concatenate page table logic address physical address to memory from CPU
CPUs-55 Page table organizations flattree page descriptor page descriptor
CPUs-56 Caching address translations Large translation tables require main memory access. TLB: cache for address translation. Typically small.
CPUs-57 ARM & SHARC memory management Memory region types: section: 1 Mbyte block large page: 64 kbytes small page: 4 kbytes An address is marked as section-mapped or page-mapped Two-level translation scheme SHARC does not have a MMU
CPUs-58 ARM address translation offset1st index2nd index physical address Translation table base register 1st level table descriptor 2nd level table descriptor concatenate
CPUs-59 Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
CPUs-60 Performance Acceleration There are 3 factors that can substantially improve system performance: Pipelining Superscalar execution Caching Need to take advantages of them where possible But, they also cause problems in analyzing the performance
CPUs-61 Pipelining Several instructions are executed simultaneously at different stages of completion Various conditions can cause pipeline stalls that reduce utilization: branches memory system delays data hazards, etc. Both ARM and SHARC have 3-stage pipes: fetch instruction from memory; decode opcode and operands; execute.
CPUs-62 ARM pipeline execution add r0,r1,#5 sub r2,r3,r6 cmp r2,#3 fetch time decode fetch execute decode fetch execute decode execute 123
CPUs-63 Pipeline performance Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency. Pipeline stall: If a step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput.
CPUs-64 fetchdecode ex ld r2 ldmia r0,{r2,r3} sub r2,r3,r6 cmp r2,#3 fetch time ex ld r3 decode ex sub fetchdecode ex cmp Data stall Multi-cycle execution and data stall LDMIA: load multiple
CPUs-65 Control stalls Branches often introduce stalls (branch penalty). Stall time may depend on whether branch is taken. May have to squash instructions that already started executing. Don’t know what to fetch until condition is evaluated.
CPUs-66 ARM pipelined branch time fetchdecode ex bne bne foo sub r2,r3,r6 fetchdecode foo add r0,r1,r2 ex bne fetchdecode ex add ex bne
CPUs-67 Delayed branch To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not SHARC supports delayed and non-delayed branches Specified by bit in branch instruction 2 instruction branch delay slot
CPUs-68 Example: SHARC code scheduling L1=5; DM(I0,M1)=R1; L8=8; DM(I8,M9)=R2; CPU cannot use DAG on cycle just after loading DAG’s register, because both need the same internal bus. CPU performs NOP between register assign and DM.
CPUs-69 Rescheduled SHARC code L1=5; L8=8; DM(I0,M1)=R1; DM(I8,M9)=R2; Avoids two NOP cycles.
CPUs-70 Example: ARM execution time Determine execution time of FIR filter: for (i=0; i<N; i++) f = f + c[i]*x[i]; Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst case.
CPUs-71 Superscalar execution Superscalar processor can execute several instructions per cycle Uses multiple pipelined data paths Programs execute faster, but it is harder to determine how much faster
CPUs-72 Superscalar Processor Control Instruction 2Instruction 1 Instruction Unit Instruction Unit Registers
CPUs-73 Data dependencies Execution time depends on operands, not just opcode. Superscalar CPU checks data dependencies dynamically: add r2,r0,r1 add r3,r2,r5 data dependency r0r1 r2r5 r3
CPUs-74 Memory system performance Caches introduce indeterminacy in execution time. Depends on order of execution. Cache miss penalty: added time due to a cache miss. Several reasons for a miss: compulsory, conflict, capacity.
CPUs-75 CPU power consumption Most modern CPUs are designed with power consumption in mind to some degree Power vs. energy: Power is energy consumption per unit time heat depends on power consumption battery life depends on energy consumption
CPUs-76 CMOS power consumption Voltage drops: power consumption proportional to V 2 Toggling: more activity means more power Leakage: basic circuit characteristics; can be eliminated by disconnecting power
CPUs-77 CPU power-saving strategies Reduce power supply voltage Run at lower clock frequency Disable function units with control signals when not in use Disconnect parts from power supply when not in use to eliminate leakage currents
CPUs-78 Power management styles Static power management: does not depend on CPU activity Example: user-activated power-down mode Dynamic power management: based on CPU activity Example: disabling off function units
CPUs-79 Application: PowerPC 603 Provides doze, nap, sleep modes for static power management Dynamic power management features: Can shut down unused execution units Cache organized into subarrays to minimize amount of active circuitry
CPUs-80 PowerPC 603 activity Percentage of time units are idle for SPEC integer/floating-point: unitSpecint92Specfp92 D cache29%28% I cache29%17% load/store35%17% fixed-point38%76% floating-point99%30% system register89%97% Idle units are turned off by switching off clocks Pipeline stages can be turned on or off
CPUs-81 Power-down costs Going into a power-down mode costs: time energy Must determine if going into mode is worthwhile Can model CPU power states with power state machine
CPUs-82 Application: StrongARM SA-1100 Processor takes two supplies: VDD is main 3.3V supply (on & off) VDDX is 1.5V (always remains on) Three power modes: Run: normal operation. Idle: stops CPU clock, with logic still powered Sleep: shuts off most of chip activity; 3 steps, each about 30 s; wakeup takes > 10 ms
CPUs-83 SA-1100 power state machine run idle sleep P run = 400 mW P idle = 50 mW P sleep = 0.16 mW 10 s 90 s 160 ms 90 s
CPUs-84 Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption Example Design: Data Compressor
CPUs-85 Goals Compress data transmitted over serial line. Receives byte-size input symbols. Produces output symbols packed into bytes. Will build software module only here.
CPUs-86 Collaboration diagram for compressor :input:data compressor:output 1..n: input symbols 1..m: packed output symbols
CPUs-87 Huffman coding Early statistical text compression algorithm. Select non-uniform size codes. Use shorter codes for more common symbols. Use longer codes for less common symbols. To allow decoding, codes must have unique prefixes. No code can be a prefix of a longer valid code.
CPUs-88 Huffman example characterP a.45 b.24 c.11 d.08 e.07 f.05 P=1 P=.55 P=.31 P=.19 P=.12
CPUs-89 Example Huffman code Read code from root to leaves: a1a1 b01 c0000 d0001 e0010 f0011
CPUs-90 Huffman coder requirements table
CPUs-91 Building a specification Collaboration diagram shows only steady-state input/output. A real system must: Accept an encoding table. Allow a system reset that flushes the compression buffer.
CPUs-92 data-compressor class data-compressor buffer: data-buffer table: symbol-table current-bit: integer encode(): boolean, data-buffer flush() new-symbol-table()
CPUs-93 Data-compressor behaviors encode: takes one-byte input, generates packed encoded symbols and a Boolean indicating whether the buffer is full. new-symbol-table: installs new symbol table in object, throws away old table. flush: returns current state of buffer, including number of valid bits in buffer.
CPUs-94 Auxiliary classes data-buffer databuf[databuflen] : character len : integer insert() length() : integer symbol-table symbols[nsymbols] : data-buffer len : integer value() : symbol load()
CPUs-95 Auxiliary class roles data-buffer holds both packed and unpacked symbols. Longest Huffman code for 8-bit inputs is 256 bits. symbol-table indexes encoded verison of each symbol. load() puts data in a new symbol table.
CPUs-96 Class relationships symbol-table data-compressor data-buffer
CPUs-97 Encode behavior encode create new buffer add to buffers add to buffer return true return false input symbol buffer filled? T F
CPUs-98 Insert behavior pack into this buffer pack bottom bits into this buffer, top bits into overflow buffer update length input symbol fills buffer? T F
CPUs-99 Program design In an object-oriented language, we can reflect the UML specification in the code more directly. In a non-object-oriented language, we must either: add code to provide object-oriented features; diverge from the specification structure.
CPUs-100 C++ classes Class data_buffer { char databuf[databuflen]; int len; int length_in_chars() { return len/bitsperbyte; } public: void insert(data_buffer,data_buffer&); int length() { return len; } int length_in_bytes() { return (int)ceil(len/8.0); } int initialize();...
CPUs-101 C++ classes, cont’d. class data_compressor { data_buffer buffer; int current_bit; symbol_table table; public: boolean encode(char,data_buffer&); void new_symbol_table(symbol_table); int flush(data_buffer&); data_compressor(); ~data_compressor(); }
CPUs-102 C code struct data_compressor_struct { data_buffer buffer; int current_bit; sym_table table; } typedef struct data_compressor_struct data_compressor, *data_compressor_ptr; boolean data_compressor_encode(data_compressor_ptr mycmptrs, char isymbol, data_buffer *fullbuf)...
CPUs-103 Testing Test by encoding, then decoding: input symbols symbol table encoderdecoder compare result
CPUs-104 Code inspection tests Look at the code for potential problems: Can we run past end of symbol table? What happens when the next symbol does not fill the buffer? Does fill it? Do very long encoded symbols work properly? Very short symbols? Does flush() work properly?