Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014.

Similar presentations


Presentation on theme: "ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014."— Presentation transcript:

1 ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014

2 2 05 Single-Cycle Datapath Asynchronous Logic –combinatorial logic Synchronous Logic – latches, registers Data Path – actual determined by control signals and mux'es Instruction Decoding – determines values of control signals 3 Types of Instruction Formats – how used to simplify decoding Opcode( bits 31-26) "funct" (bits (5-0) determines ALU operation PC Register – 3 possible inputs: pc+4, branch, jump Given a diagram like slide 05-28, or VHDL like 05-32; know how to figure which control lines are set (true) for: add, lw, sw, j, jr, beq See 05-33, 34, 35, or 37 Energy use depends on average number of gates that switch per cycle Which uses more: Register Set or ALU (during add). Given VHDL code, or combinatorial logic, for simple ALU, be able to modify to add another operation (e.g., shift, subtract, bit-wise logic, …)

3 Datapath With Jumps Added 3 Slide 05-37

4 4 06 Multi-Cycle Datapath Divide Datapath into five phases – Fetch Instr., Decode Instr., Execute (calculate), Memory R/W, Write Back - (IF, ID, EX, MEM, WB) Given a diagram like slide 06-8, 11, 17, 18, 20, 22, 23,24; know how to figure which control lines (ALUOp, ALUSrcA, MemWrite, …) are set (true) for: add, lw, sw, j, jr, beq, … Interpret a state diagram like 06-10 or 06-30 – how many clock cycles for br, add, lw. For given MIPS assembler code: how many cycles to execute. State diagram for decoding instruction - ways to implement: ROM Implementation – very inefficient PLA " – Inputs: 5 Opcode bits and 4 state bits (“funct” bits decoded apart) Outputs: 1 for each control line (control signal) Vertical Lines: Minimized by choosing similar bits for closely related functions (06-32). CISC (complex instruction set computer) – uses microcode, more ticks/instr Unexpected change in flow of control: Exception (error) vs. Interrupt (I/O) Two registers just for Exception Handling: cause and status

5 Pipelining Example Instruction memory Address 4 32 0 Add Add result Shift left 2 I n s t r u c t i o n IF/IDEX/MEMMEM/WB M u x 0 1 Add PC 0 Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX Data memory Address add $14, $5, $6 lw $13, 24($1) add $12, $3, $4 sub $11, $2, $3 lw $10, 20($1) RegDst 0 1 M u x Instruction [20–16] Instruction [15–11] Pipeline stage execution time Note what is happening in the register file Slide 08a-21 5

6 Pipelined Control Control signals derived from instruction – As in single-cycle implementation Pass control signals along like data Slide 08a-23

7 7 07 Performance Determined by: Algorithm & Language, Compiler, (hardware) Architecture Different Metrics, and different numbers for different "bench mark" programs SPEC2005 – performance doing a standard set of bench mark programs "Throughput" – work done per unit time "Response Time" (latency) – time to do a task (e.g., display a Web page) Trading throughput versus latency (for multitasking or parallel processing) Instructions per second = clock rate / average cycles per instruction Programs with heavy I/O or memory R/W's may depend less on CPU speed. Amdahl's Law – know definition, and do calculations like 07-18. Corollary – "make the common case fast" Increased clock rate  higher temperature and lower energy efficiency,

8 8 08a Pipelined Datapath Use clocked registers to divide Datapath's five stages – Fetch Instr., Decode Instr., Execute, Memory R/W, Write Back - (IF, ID, EX, MEM, WB) Use more clocked registers to delay control signals as instruction moves through stages (08a-23). Hazards: Structural, Control, Data Structural – need to separate Instruction and Data memory (or caches) Control – guess "to branch" or not – stall when wrong – how? Can branch be determined 2 or 3 instructions ahead - if so ? How many real br instructions are there? only bre (branch if equal) Jump/Br must have next instruction started before (kill if wrong) Data – add logic devices to "forward" data needed by next instruction. Forwarding unit – controls when to forward – from EXE or MEM stage Problem with" Load Word" and Use –requires a" nop" insertion unless order of following instructions can be swapped ("code scheduling"). MIPS can predict branch always taken (wrong when loop ends) More complicated CPU's have "dynamic" branch prediction. MIPS – Exception causes Jump to "Exception Handler” procedure in OS.

9 9 08b Pipelined – Thread Level Parallelism Flynn's Taxonomy - Single/Multiple Instructions Single/Multiple Data Multithreading – MIMD – parallel threads or LWPs (light-weight processes) Process – running program with state (register values, include PC, pipeline r's). Scheduled by OS –swapped out, restarted later. Threads have own state, stack memory (local data), but share static data Hardware assisted switching between threads – separate hardware registers instructions may be interleaved in pipeline ("Fine-Grain Multithreading") "Course-Grained Multithreading" – swap on long stall (like cache miss) Synchronization between threads – requires hardware "atomic commands" "Test and Set" without switching threads between the Test and the Set. When altering common data, thread must tell OS "start/end Critical Path" Simultaneous Multithreading on multiple cores (data paths). Amdahl's Law – increased multithreading reaches "point of diminishing return” Big time Parallelism – requires special many-core computers, high-level languages, shared and local memory. "Grid" has cores on 2-d grid of buses. Communication (assigning tasks, collecting results) – by passing messages.

10 Different levels – large, slow, and low cost  small, fast, and high cost. Hard disk (Terabyte) $ 0.25 per Gigabytemillisecond DRAM (Gigabyte) $ 20.00 per Gigabyteseveral nanoseconds Static Ram (Megabytes) $ 2000 per Gigabytefraction of nanosecond RegistersOn CPU – limited area and power Use multiple levels to get low average cost, high-speed for data being used. Multicore computers need a fast cache memory for each core Data movement takes more energy than computation What to put in cache? Temporal (recent) and Spatial (nearby) How to locate data. Direct Cache: address parts: [ tag ][ cache address or index][ byte location ] Size of cache: = 2^(no. of index bits) x 2^((no. of byte location bits) = number of blocks x block size Each block also has valid bit, dirty bit, tag. To read or write data, look at indexed block, match tag, select byte location. If not found (tag did not match), put contents in write queue if "dirty" before getting new data from memory (all byte-addresses that start with tag + index) "Write Through" immediately update memory, "Write Back" just write cache. 09a Memory 10

11 IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block 1610 000?000 300 011?011 1610 000?000 09a-20 09a-21 Previous State of Cache What is new State of Cache? Then This Happens Answer on 09a-21 11


Download ppt "ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014."

Similar presentations


Ads by Google