331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane.

331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides]

331 W08.2Spring 2005 Head’s Up  This week’s material l CPU performance -Reading assignment – PH 4 l Building a MIPS datapath -Reading assignment – PH 5.1-5.2  Next week’s material l Single cycle datapath implementation -Reading assignment – PH 5.3 and C.1 through C.2

331 W08.3Spring 2005 Performance  Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best performance / cost ?  Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best performance / cost ?  Both require l basis for comparison l metric for evaluation  Our goal is to understand cost & performance implications of architectural choices

331 W08.4Spring 2005 Two notions of “performance” ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns... (Performance) – throughput, bandwidth Response time and throughput often are in opposition Plane Boeing 747 BAD/Sud Concodre Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pmph) 286,700 178,200 Which has higher performance?

331 W08.5Spring 2005 Definitions  Performance is in units of things-per-second l bigger is better  If we are primarily concerned with response time l performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n =---------------------- Performance(Y)

331 W08.6Spring 2005 Example  Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” Boeing is 1.6 times (“60%”)faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job

331 W08.7Spring 2005 Basis of Evaluation Actual Target Workload Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks Pros Cons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance

331 W08.8Spring 2005 SPEC95  Eighteen application benchmarks (with inputs) reflecting a technical computing workload  Eight integer l go, m88ksim, gcc, compress, li, ijpeg, perl, vortex  Ten floating-point intensive l tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5  Must run with standard compiler flags l eliminate special undocumented incantations that may not even generate working code for real programs

331 W08.9Spring 2005 Metrics of performance Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Useful Operations per second Each metric has a place and a purpose, and each can be misused

331 W08.10Spring 2005 Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr. countCPIclock rate Program Compiler Instr. Set Arch. Organization Technology

331 W08.11Spring 2005 CPI CPU time = ClockCycleTime * CPI * I i = 1 n ii CPI = CPI * F where F = I i = 1 n i i i i Instruction Count "instruction frequency" Invest Resources where time is Spent! CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “Average cycles per instruction”

331 W08.12Spring 2005 Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20%5 1.045% Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?

331 W08.13Spring 2005 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S

331 W08.14Spring 2005 Summary: Evaluating Instruction Sets? Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! NOTE: this depends on instructions set, processor organization, and compilation techniques. CPI Inst. CountCycle Time

331 W08.15Spring 2005 Review: Design Principles  Simplicity favors regularity l fixed size instructions – 32-bits l only three instruction formats  Good design demands good compromises l three instruction formats  Smaller is faster l limited instruction set l limited number of registers in register file l limited number of addressing modes  Make the common case fast l arithmetic operands from the register file (load-store machine) l allow instructions to contain immediate operands

331 W08.16Spring 2005  We're ready to look at an implementation of the MIPS  Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j  Generic implementation: l use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) l decode the instruction (and read registers) l execute the instruction  All instructions (except j ) use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? The Processor: Datapath & Control Fetch PC = PC+4 DecodeExec

331 W08.17Spring 2005 Abstract Implementation View  Two types of functional units: l elements that operate on data values (combinational) l elements that contain state (sequential)  Single cycle operation  Split memory (Harvard) model - one memory for instructions and one for data AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data

331 W08.18Spring 2005 Clocking Methodologies  Clocking methodology defines when signals can be read and when they can be written falling (negative) edge rising (positive) edge cycle time clock rate = 1/(cycle time) e.g., 10 nsec cycle time = 100 MHz clock rate 1 nsec cycle time = 1 GHz clock rate  State element design choices l level sensitive latch l master-slave and edge-triggered flipflops

331 W08.19Spring 2005 Review: State Elements  Set-reset latch  Level sensitive D latch l latch is transparent when clock is high (copies input to output) R S Q !Q RSQ(t+1)!Q(t+1) 1001 0110 00Q(t)!Q(t) 1100 clock D Q !Q clock D Q

331 W08.20Spring 2005 Review: State Elements, con’t  Race problem with latch based design …  Consider the case when D-latch0 holds a 0 and D- latch1 holds a 1 and you want to transfer the contents of D-latch0 to D-latch1 and vica versa l must have the clock high long enough for the transfer to take place l must not leave the clock high so long that the transferred data is copied back into the original latch  Two-sided clock constraint D clock Q !Q D-latch0 D clock Q !Q D-latch1 clock

331 W08.21Spring 2005 Review: State Elements, con’t  Solution is to use flipflops that change state (Q) only on clock edge (master-slave) -master (first D-latch) copies the input when the clock is high (the slave (second D-latch) is locked in its memory state and the output does not change) -slave copies the master when the clock goes low (the master is now locked in its memory state so changes at the input are not loaded into the master D-latch)  One-sided clock constraint l must have the clock cycle time long enough to accommodate the worst case delay path D clock Q !Q D-latch D clock Q !Q D-latch Q !Q D clock D Q

331 W08.22Spring 2005 Our Implementation  An edge-triggered methodology  Typical execution l read contents of some state elements l send values through some combinational logic l write results to one or more state elements  Assumes state elements are written on every clock cycle; if not, need explicit write control signal l write occurs only when both the write control is asserted and the clock edge occurs State element 1 State element 2 Combinational logic clock one clock cycle

331 W08.23Spring 2005 Fetching Instructions  Fetching instructions involves l reading the instruction from the Instruction Memory l updating the PC to hold the address of the next instruction l PC is updated every cycle, so it does not need an explicit write control signal l Instruction Memory is read every cycle, so it doesn’t need an explicit read control signal Read Address Instruction Memory Add PC 4

331 W08.24Spring 2005 Decoding Instructions  Decoding instructions involves l sending the fetched instruction’s opcode and function field bits to the control unit Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 Control Unit l reading two values from the Register File -Register File addresses are contained in the instruction

331 W08.25Spring 2005 Executing R Format Operations  R format operations ( add, sub, slt, and, or ) l perform the indicated (by op and funct) operation on values in rs and rt l store the result back into the Register File (into location rd) Note that Register File is not written every cycle (e.g. sw ), so we need an explicit write control signal for the Register File Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU overflow zero ALU controlRegWrite R-type: 3125201550 oprsrtrdfunctshamt 10

331 W08.26Spring 2005 Executing Load and Store Operations  Load and store operations l compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction -base register was read from the Register File during decode -offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value l store value, read from the Register File during decode, must be written to the Data Memory l load value, read from the Data Memory, must be stored in the Register File I-Type: oprsrt address offset 312520150

331 W08.27Spring 2005 Executing Load and Store Operations, con’t Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU overflow zero ALU controlRegWrite Data Memory Address Write Data Read Data Sign Extend MemWrite MemRead

331 W08.28Spring 2005 Executing Branch Operations  Branch operations have to compare the operands read from the Register File during decode (rs and rt values) for equality ( zero ALU output) l compute the branch target address by adding the updated PC to the sign extended16-bit signed offset field in the instruction -“base register” is the updated PC -offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address I-Type: oprsrt address offset 312520150

331 W08.29Spring 2005 Executing Branch Operations, con’t Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU zero ALU control Sign Extend 1632 Shift left 2 Add 4 PC Branch target address (to branch control logic)

331 W08.30Spring 2005 Executing Jump Operations  Jump operations have to l replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Read Address Instruction Memory Add PC 4 Shift left 2 Jump address 26 4 28 J-Type: op 31250 jump target address

331 W08.31Spring 2005  We wait for everything to settle down l ALU might not produce “right answer” right away l we use write signals along with the clock edge to determine when to write (to the Register File and the Data Memory)  Cycle time determined by length of the longest path Our Simple Control Structure We are ignoring some details like register setup and hold times

331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane.

Similar presentations

Presentation on theme: "331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane.

Similar presentations

Presentation on theme: "331 W08.1Spring 2005 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane."— Presentation transcript:

Similar presentations

About project

Feedback