EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Performance of Single-Cycle Machines Let's assume that the operation time for the following units is: Memory - 2 nanoseconds (ns), ALU and adders - 2 ns, Register file - 1 ns. We will assume that MUXs, control, sign-extension, PC accesses, and wires have no delays. Which implementation is faster? 1. Every instruction operates in 1 clock cycle of fixed length. 2. Every instruction operates in a varying length clock cycle. Lets look at the time needed by each instruction: Inst. Fetch Reg. Rd ALU op Memory Reg. Wr Total R-Type ns Load ns Store ns Branch ns Jump 2 2ns
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Fixed vs. Variable Cycle Length Lets Assume a program has the following instruction mix: 24% loads, 12% stores, 44% R-type, 18% branches, 2% jumps. For the fixed cycle length the cycle time is 8 ns, long enough for the longest instruction (load). Thus each instruction takes 8 ns to execute. For the variable cycle time the average CPU clock cycle is: 8*24% + 7*12% + 6*44% + 5*18% + 2*2% = 6.3 ns It is obvious that the variable clock implementation is faster but it is extremely hard to implement. Variable clock implementation is 8/6.3 = 1.27 times faster When adding instructions such as multiply and divide which can take tens of cycles this scheme is too slow.
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Observations on the Single Cycle Design The single-cycle datapath is straightforward, but... It has to use 3 separate ALU’s It has separate Instruction and Data memories Cycle time is determined by worst-case path A multi-cycle datapath might be better We can reuse some of the hardware We can combine the memories Cycle time is still constant, but instructions may take differing numbers of cycles
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Multi-Cycle Implementation Each step in execution = 1 clock Each Instruction of different clock cycles Functional unit can be used more than once per instruction as long as it is used on different clock cycles Reduce and Share Hardware units
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Multicycle Datapath Single Instruction & Data Memory Single ALU Registers
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Multicycle Execution Instruction Register(IR) Holds instruction until end of execution Memory Data Register(MDR) A Register B Register ALUOut Register
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Multicycle Datapath Inst/Data Memory InstructionAddress Data Address Register Block ALU Arithmetic/ branch Instruction lw/sw Instruction PC = PC +4 Branch target address
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Multicycle Datapath
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance MultiCycle Datapath & Control Signals
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance One Single ALU One single ALU is used to perform all of the necessary functions: An arithmetic operation on two register operands Add a register to a sign-extended constant, for computing memory addresses in lw/sw instructions Compute PC+4 to increment the PC Add a sign-extended, shifted offset to (PC+4) for branches
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Implications of Shared Functional Units Need to add multiplexors or expand existing multiplexors e.g. Memory unit now contains both instructions (address in PC) and data (address in ALUOut) e.g. ALU now must accommodate all inputs from previous ALU and adders.
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance Two extra multiplexers To enable all the actions listed for the ALU, two extra multiplexers are needed A 2-to-1 mux, ALUsrcA, selects whether the first ALU input is the PC or a register A 4-to-1 mux, ALUSrcB, selects the 2nd input from among the register file a constant 4 a sign-extended constant, and a sign-extended and shifted constant
Hina Anwar Khan Spring EE204 L12-Single Cycle DP Performance One single memory One single memory is used in both the instruction fetch and data access stages. The address for this memory may come from: the PC register, when fetching an instruction the ALU output, when doing a lw/sw instruction and need the effective memory address. => add a 2-to-1 mux, IorD, to select whether the memory is being accessed for instructions or for data.