John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS152 Computer Architecture and Engineering Lecture 3 Performance, Technology & Delay Modeling September 5, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: OK, let’s start today’s lecture with a recap of what you learned last Friday. Last Wednesday, Professor Dave Patterson gave you a lecture on the MIPS instruction set architecture. And here are some of the important things you need to keep in mind if you are going to design a new instruction set. Start Time: X:40 +1 = 1 min. (X:41) 9/5/01 ©UCB Fall 2001

Review: Salient features of MIPS I
32-bit fixed format inst (3 formats) 32 32-bit GPR (R0 contains zero) and 32 FP registers (+ HI LO) partitioned by software convention 3-address, reg-reg arithmetic instr. Single address mode for load/store: base+displacement no indirection, scaled 16-bit immediate plus LUI Simple branch conditions compare against zero or two registers for =, no integer condition codes Support for 8bit, 16bit, and 32bit integers Support for 32bit and 64bit floating point. 9/5/01 ©UCB Fall 2001

Review: MIPS Addressing Modes/Instruction Formats
All instructions 32 bits wide Register (direct) op rs rt rd register Immediate op rs rt immed Base+index op rs rt immed Memory register + PC-relative op rs rt immed Memory PC + 9/5/01 ©UCB Fall 2001

Review: When does MIPS sign extend?
When value is sign extended, copy upper bit to full value: Examples of sign extending 8 bits to 16 bits:   When is an immediate value sign extended? Arithmetic instructions (add, sub, etc.) sign extend immediates even for the unsigned versions of the instructions! Logical instructions do not sign extend addi $r2, $r3, has 0xFFFF in immediate field and will extend to 0xFFFFFFFF before adding andi $r2, $r3, -1 has 0xFFFF in immediate field and will extend to 0x0000FFFF before anding Kinda weird to put negative numbers in logical instructions 9/5/01 ©UCB Fall 2001

Review: Details of the MIPS instruction set
Register zero always has the value zero (even if you try to write it) Branch/jump and link put the return addr. PC+4 into the link register (R31), also called “ra” All instructions change all 32 bits of the destination register (including lui, lb, lh) and all read all 32 bits of sources (add, and, …) The difference between signed and unsigned versions: For add and subtract: signed causes exception on overflow No difference in sign-extension behavior! For multiply and divide, distinguishes type of operation Thus, overflow can occur in these arithmetic and logical instructions: add, sub, addi it cannot occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu, div, divu Immediate arithmetic and logical instructions are extended as follows: logical immediates ops are zero extended to 32 bits arithmetic immediates ops are sign extended to 32 bits (including addu) The data loaded by the instructions lb and lh are extended as follows: lbu, lhu are zero extended lb, lh are sign extended 9/5/01 ©UCB Fall 2001

Calls: Why Are Stacks So Great?
Stacking of Subroutine Calls & Returns and Environments: A A: CALL B CALL C C: RET B: A B A B C A B A Some machines provide a memory stack as part of the architecture (e.g., VAX) Sometimes stacks are implemented via software convention (e.g., MIPS) 9/5/01 ©UCB Fall 2001

Memory Stacks Useful for stacked environments/subroutine call & return even if operand stack not part of architecture Stacks that Grow Up vs. Stacks that Grow Down: inf. Big 0 Little Next Empty? grows up grows down Memory Addresses c b Last Full? SP a 0 Little inf. Big How is empty stack represented? Big  Little: Last Full POP: Read from Mem(SP) Increment SP PUSH: Decrement SP Write to Mem(SP) Big  Little: Next Empty POP: Increment SP Read from Mem(SP) PUSH: Write to Mem(SP) Decrement SP 9/5/01 ©UCB Fall 2001

Call-Return Linkage: Stack Frames
High Mem ARGS Reference args and local variables at fixed (positive) offset from FP Callee Save Registers (old FP, RA) Local Variables FP Grows and shrinks during expression evaluation SP Low Mem Many variations on stacks possible (up/down, last pushed / next ) Compilers normally keep scalar variables in registers, not memory! 9/5/01 ©UCB Fall 2001

MIPS: Software conventions for Registers
0 zero constant 0 1 at reserved for assembler 2 v0 expression evaluation & 3 v1 function results 4 a0 arguments 5 a1 6 a2 7 a3 8 t0 temporary: caller saves (callee can clobber) 15 t7 16 s0 callee saves . . . (callee must save) 23 s7 24 t8 temporary (cont’d) 25 t9 26 k0 reserved for OS kernel 27 k1 28 gp Pointer to global area 29 sp Stack pointer 30 fp frame pointer 31 ra Return Address (HW) 9/5/01 ©UCB Fall 2001

MIPS / GCC Calling Conventions
fact: addiu $sp, $sp, -32 sw $ra, 20($sp) sw $fp, 16($sp) addiu $fp, $sp, 32 . . . sw $a0, 0($fp) ... lw $ra, 20($sp) lw $fp, 16($sp) addiu $sp, $sp, 32 jr $ra FP SP ra low address FP SP ra ra old FP FP SP ra old FP First four arguments passed in registers Result passed in $v0/$v1 9/5/01 ©UCB Fall 2001

Delayed Branches li r3, #7 sub r4, r4, 1 bz r4, LL addi r5, r3, 1
subi r6, r6, 2 LL: slt r1, r3, r5 In the “Raw” MIPS, the instruction after the branch is executed even when the branch is taken? This is hidden by the assembler for the MIPS “virtual machine” allows the compiler to better utilize the instruction pipeline (???) 9/5/01 ©UCB Fall 2001

Is this a violation of the ISA abstraction?
Branch & Pipelines Time li r3, #7 execute sub r4, r4, 1 ifetch execute bz r4, LL ifetch execute Branch addi r5, r3, 1 ifetch execute Delay Slot LL: slt r1, r3, r5 ifetch execute Branch Target By the end of Branch instruction, the CPU knows whether or not the branch will take place. However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken. Why not execute it? Is this a violation of the ISA abstraction? 9/5/01 ©UCB Fall 2001

Performance Purchasing perspective
given a collection of machines, which has the best performance ? least cost ? best performance / cost ? Design perspective faced with design options, which has the best performance improvement ? Both require basis for comparison metric for evaluation Our goal is to understand cost & performance implications of architectural choices 9/5/01 ©UCB Fall 2001

Two notions of “performance”
Plane DC to Paris 6.5 hours 3 hours Speed 610 mph 1350 mph Passengers 470 132 Throughput (pmph) 286,700 178,200 Boeing 747 BAD/Sud Concorde Which has higher performance? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth Response time and throughput often are in opposition 9/5/01 ©UCB Fall 2001

Definitions " X is n times faster than Y" means Performance(X)
Performance is in units of things-per-second bigger is better If we are primarily concerned with response time performance(x) = execution_time(x) " X is n times faster than Y" means Performance(X) n = Performance(Y) 9/5/01 ©UCB Fall 2001

Example Time of Concorde vs. Boeing 747?
Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph = 1.60 “times faster” Boeing is 1.6 times (“60%”) faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction throughput important! 9/5/01 ©UCB Fall 2001

Basis of Evaluation Cons Pros very specific non-portable
difficult to run, or measure hard to identify cause representative Actual Target Workload portable widely used improvements useful in reality less representative Full Application Benchmarks easy to run, early in design cycle Small “Kernel” Benchmarks easy to “fool” “peak” may be a long way from application performance identify peak capability and potential bottlenecks Microbenchmarks 9/5/01 ©UCB Fall 2001

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs 9/5/01 ©UCB Fall 2001

Metrics of performance
Answers per month Useful Operations per second Application Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Each metric has a place and a purpose, and each can be misused 9/5/01 ©UCB Fall 2001

Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle instr count CPI clock rate Program X Compiler X X Instr. Set X X X Organization X X Technology X 9/5/01 ©UCB Fall 2001

CPI CPI = (CPU Time * Clock Rate) / Instruction Count
“Average cycles per instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPU time = ClockCycleTime * CPI * I i i i = 1 n CPI =  CPI * F where F = I i i i i i = 1 Instruction Count "instruction frequency" Invest Resources where time is Spent! 9/5/01 ©UCB Fall 2001

Amdahl's Law Speedup due to enhancement E:
ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = (1-F) + F/S 9/5/01 ©UCB Fall 2001

Example (RISC processor)
Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% % Load 20% % Store 10% % Branch 20% % 2.2 Typical Mix How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? 9/5/01 ©UCB Fall 2001

Evaluating Instruction Sets?
Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! CPI Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. 9/5/01 ©UCB Fall 2001

Administrative Matters
HW #2/Lab #2 out Tonight Will be using 117 as primary lab, with 119/111 as backup Get card-key access to Labs on lower floor of Cory. Go see if you can log into windows-2000 machines! User name is cs152-XXX where XXX is your UNIX “named” account I posted current list of accounts to newsgroup Use SID for password first time Sections start Monday! 11:00-2:00 in 320 Soda and 2:00 – 4:00 in 72 Evans TA Office hours now posted on information page Office hours in 117 Cory Want announcements directly via ? Look at information page to sign up for “cs252-announce” mailing list. This mailing list is automatically forwarded to the newsgroup, so you do not have to sign up for mailing list. Prerequisite quiz will be Friday 9/7 during class Review tonight (9/5), 7:00 – 9:00 pm here (306 Soda) Review Chapters 1-4, , Ap, B of COD, Second Edition Turn in survey form (with picture!) Homework #1 also due Friday 9/7 at beginning of lecture! No homework quiz this time (Prereq quiz may contain homework material, since this is supposed to be review) 9/5/01 ©UCB Fall 2001

Finite State Machines:
System state is explicit in representation Transitions between states represented as arrows with inputs on arcs. Output may be either part of state or on arcs Alpha/ Delta/ 2 Beta/ 1 “Mod 3 Machine” Input (MSB first) 106 Mod 3 1 1 1 1 9/5/01 ©UCB Fall 2001

Implementation as Combinational logic + Latch
Alpha/ Delta/ 2 Beta/ 1 0/0 1/0 1/1 0/1 Latch Combinational Logic “Moore Machine” “Mealey Machine” 9/5/01 ©UCB Fall 2001

Performance and Technology Trends
0.1 1 10 100 1000 1965 1970 1975 1980 1985 1990 1995 2000 Microprocessors Minicomputers Mainframes Supercomputers Year Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr. Density: improves 1.2x / yr. Die Area: 1.2x / yr. The lesson of RISC is to keep the ISA as simple as possible: Shorter design cycle => fully exploit the advancing technology (~3yr) Advanced branch prediction and pipeline techniques Bigger and more sophisticated on-chip caches Recall the performance chart from the first lecture: performance of all computers advance in a rapid pace. Even though computer architects like to think that this rapid increase in performance is caused by their clever ideas. Deep down, almost every one agrees that this rapid performance growth is caused by the technology that behinds it. Here are some estimates to give you an idea on how rapidly the technology is evolving: a. The feature size, that is the size of the transistor, is shrinking about 10% a year. Smaller transistors means faster switching. b. Technology advances also enables us to pack ~20% more components into the same area every year. c. Last but not least, we are able to manufacture chips that are 20% bigger every year. Consequently, the advance in technology is enabling us to have 1.2 to the power of 3, or 1.7 times more computer power every year. +2 = 6 min. (X:46) 9/5/01 ©UCB Fall 2001

Range of Design Styles Performance Design Complexity (Design Time)
Custom Design Standard Cell Gate Array/FPGA/CPLD Gates Gates Custom ALU Routing Channel Standard ALU Custom Control Logic Gates Routing Channel Custom Register File Standard Registers Gates Now you have the basic components, there are several design styles you can follow. The cheapest and lowest performance design style is the gate array technology/FPGA/PLD. The chip you produce consists of rows of logic gates and rows of routing channels. Computer Aid Design (CAD) tools from the manufacturer can usually map your design into the available gates and route the design for you. The manufacturer usually have the chips “prefab” to the points where all the gates are already made and all they need to when they receive your design is to put in all the routing wires that “customize” the chip for you. Consequently, the gate array design style is the cheapest and has the shortest turn around time but also has the lowest performance. On the other extreme of the spectrum is custom design. In this case, you design EVERYTHING. I don’t just mean logic design, but you also have to layout all the transistors. This is the most time consuming design style but also has the potential to have the highest performance since you take advantage of every aspect of the technology. Standard cell design is in the middle. In this case, the manufacturer has already design some common parts such as ALU, register file, MUX, ... etc. and place them in a library. All you have to do and pick these components and connect them together. One thing to keep in mind when you pick a design style is design time. It may not do you any good if your design is slightly faster using the custom design style but it comes two years later than if your design is done in gate array! But don’t forget, TIME IS MONEY!!! +3 = 21 min. (Y:01) Performance Design Complexity (Design Time) Longer wires Compact 9/5/01 ©UCB Fall 2001

Basic Technology: CMOS
CMOS: Complementary Metal Oxide Semiconductor NMOS (N-Type Metal Oxide Semiconductor) transistors PMOS (P-Type Metal Oxide Semiconductor) transistors NMOS Transistor Apply a HIGH (Vdd) to its gate turns the transistor into a “conductor” Apply a LOW (GND) to its gate shuts off the conduction path PMOS Transistor Apply a HIGH (Vdd) to its gate shuts off the conduction path Apply a LOW (GND) to its gate turns the transistor into a “conductor” Vdd = 5V GND = 0v Vdd = 5V Most of the memory and microprocessor chips in the market today are based on a relatively simple technology called CMOS. CMOS stands for Complementary Metal Oxide Semiconductor. The key word to remember is “Complementary” because it implies this technology consists of two types of transistors: N-type (NMOS) and P-type (PMOS). PMOS is ideal to pass a high voltage (“1)” while NMOS is ideal to pass a low voltage(“0”); and these are all we need to build a binary digital computer. The NMOS transistor pretty much follow our common sense: apply a high voltage to its gate, the transistor turns into a conductor and allows current to flow in either direction. Apply a low voltage to its gate, the transistor is turned off and shuts off the conduction path between the two terminals. The PMOS transistor acts exactly opposite-that’s whey they called it complementary :-) Apply a high voltage to its gate, the transistor is turned off and you have a good insulator between the two terminals. Apply a low voltage to its gate, the transistor is turned on and you have a conductor that allows current to flow in either direction. PMOS transistors are good at conduction at high voltage level while NMOS transistors are good at conduction at low voltage level. The complementary structure enables fast transition at both high and low logic states. +2 = 11 min. (X:51) GND = 0v 9/5/01 ©UCB Fall 2001

Basic Components: CMOS Inverter
Vdd Circuit Symbol PMOS In Out In Out NMOS Inverter Operation Vdd Vout Vdd Vdd Open Charge Vdd Out Open Discharge The simplest component you can build using these transistors is an inverter that consists of two transistors: A PMOS and a NMOS transistor. The gates of these two transistors are connected together to form the inverter input. One side of these two transistors are connected together to form the inverter output. The other side of the PMOS transistor is connected to the power supply: an almost infinite supply of electrons. The other side of the NMOS transistor is connected to the GND: an bottomless sink for electrons. Recall that when you apply a low voltage to the NMOS transistor, the transistor is shut off. At the same time, apply a low voltage to the PMOS transistor turns it into a conductor. Consequently, if you image you have a bucket at the output of the inverter to collect the electrons, your bucket will get fill up and you have a HIGH output (points to the curve). As you slowly increase the input voltage, you will slowly turn off the PMOS transistors because as you may recall, the PMOS transistors has a “funny” habit of turning itself into an insulator as its gate voltage approaches HIGH. At the same time, a high input voltage will turn on the NMOS transistor. This is like pulling the plug in your kitchen sink and all the electrons you collect in your charge bucket will be drained and you have a LOW output. Any question so far? Good, now you know everything you need to know about electronics for this class :-) +3 = 14 min. (X:54) Vdd Vin 9/5/01 ©UCB Fall 2001

Basic Components: CMOS Logic Gates
NOR Gate NAND Gate A B Out 1 A B Out 1 Out A B A B Out Vdd A B Out Vdd A B Out The inverter alone cannot perform any “magic” for you. What you need are logic gates. The first logic gate I want to introduce to you is the NAND gate. You build a NAND gate by connecting two NMOS transistors in series and two PMOS transistors in parallel. The two NMOS transistors in series require BOTH inputs to be the HIGH state in order to form a path from the gate’s output to GND and pull the output to low. On the other hand, the two PMOS transistors in parallel require only 1 inputs to be in the LOW state to form a path from the power supply to the gate’s output to drive the output to H. The NOR gate is the “dual” of the NAND gate. It has two NMOS transistors in parallel and two PMOS transistors in series. The two NMOS transistors in parallel enable the output to be pulled low if ONLY one of the input is in the High state. On the other hand, the two PMOS transistors in series require BOTH inputs to be in the Low state in order to pull the output to HIGH. With all transistors sizes being equal, a NAND gate can pass a high signal faster than a NOR gate; similarly, a NOR gate can pass a low signal faster than a NAND gate. This is because a parallel conduction path is usually faster than a serial path. +2 = 16 min. (X:56) 9/5/01 ©UCB Fall 2001

Gate Comparison NOR Gate NAND Gate If PMOS transistors is faster:
Vdd A B Out NOR Gate NAND Gate If PMOS transistors is faster: It is OK to have PMOS transistors in series NOR gate is preferred NOR gate is preferred also if H -> L is more critical than L -> H If NMOS transistors is faster: It is OK to have NMOS transistors in series NAND gate is preferred NAND gate is preferred also if L -> H is more critical than H -> L You all have taken the logic design class so you know that by De Morgan’s law, you can implement any logic function using e either the NAND gate or the NOR gate exclusively. Then the next question you may want to ask is: which gate should you use? Well it depends. It depends on the manufacturing process. Some manufacturing process will give you faster PMOS transistors and some manufacturing process will give you faster NMOS transistors. For CMOS, NMOS transistor is faster due to the faster mobility of electrons in N-type device. If PMOS transistors is faster, then it is OK to have PMOS transistors in series and the NOR gate is preferred. Also, in some situations, you may want (or need) to drive a signal to low as quick as possible but it can afford to take longer to drive it to high, then the NOR gate is preferred. On the other hand, if NMOS transistor is faster, then you are better off if the NMOS transistors are in series and the NAND gate is preferred. The NAND gate is also preferred if you need to drive a signal to high as quick as possible. +2 = 18 min. (X:58) 9/5/01 ©UCB Fall 2001

Ideal versus Reality When input 0 -> 1, output 1 -> 0 but NOT instantly Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v When input 1 -> 0, output 0 -> 1 but NOT instantly Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v) Voltage does not like to change instantaneously 1 => Vdd You may ask yourself, why am I sitting here listening to all these transistors stuff while I am registered for a computer science class? Well the reason is that we want to teach you to be a computer engineer, not a scientist. As an engineer, you need to know the limitations God placed on the non-ideal world. More specifically, let’s look at an inverter. When the Inverter’s input goes from 0 to 1, the output of the inverter will go from 1 to 0, but not instantly. The voltage will drop from 5v to 0 gradually as shown in this plot. Similarly, when the inverter’s input goes from 1 to 0, the output of the inverter will go from 0 to 1, but once again, NOT instantly. It will go from 0v to 5V gradually as shown here. The bottom line here is that voltage, like any other physical quantities, does not like to change instantaneously. +2 = 27 min. (Y:07) Vout Out In Voltage Vin 0 => GND Time 9/5/01 ©UCB Fall 2001

Fluid Timing Model Level (V) = Vdd Vdd Tank Level (Vout) SW1 SW1 SW2 Sea Level (GND) Vout Cout SW2 Reservoir Tank (Cout) Bottomless Sea Water <-> Electrical Charge Tank Capacity <-> Capacitance (C) Water Level <-> Voltage Water Flow <-> Charge Flowing (Current) Size of Pipes <-> Strength of Transistors (G) Time to fill up the tank proportional to C / G The best way to think about this is to use water flow as an analogy. In this analogy, the power supply at Vdd (5V) is modeled by a huge reservoir. No matter how much water you use yourself, you are not going to affect the water level at the reservoir. Anyway, the GND is like the bottomless sea where you can dump all your water to it without affecting its level either. The output of your inverter, which is connected to a capacitor here, is modeled by the tank. Just like a tank which collects water, the capacitor collects electrical charges (1st bullet). The tank capacity is analogous to the capacitance value of the capacitor. Water level here is analogous to voltage level and water flow is analogous to electrical flow of the charges, or in electrical engineer’s term, current. The size of the pipes are analogous to the strength, or conductance, of the transistors. If the tank is empty, you can fill up the tank by closing Valve 2 and open up Valve 1. Once the tank is full, you can empty it by closing Valve 1 and open up Value 2. The time it takes to fill or empty the tank depends on two factors: (a) The tank capacity and (b) The size of the pipes. Similarly, the time it takes to charge this capacitor to 5v or discharge it back to 0v depends on the capacitance of this capacitor and the strength of these two transistors. +3 = 30 min. (Y:10) 9/5/01 ©UCB Fall 2001

Series Connection Vdd Vdd Cout Vout V1 Vin Vout Vin V1 G1 G2 G1 G2 C1 Voltage Vdd Vin GND V1 Vout Vdd/2 d1 d2 Time So far, we have look at the delay of one gate. But any system with only one gate is really not a very interesting system. So let’s look at a more interesting system with two gates. For simplicity, I am representing the two gates as two inverters. The total delay of this system will be the sum of the delay of these two gates. More specifically, if we apply a step low to high transition at the input of the first gate, its output will go from high to low gradually and after a delay of d1, it will have fallen to Vdd/2. At some point during this transition (points to V1 going H -> L), the PMOS transistor of the second inverter will start conducting and start driving the output toward 5v. For simplicity, we have decided to measure the delay from the point when V1 and Vout reaches Vdd/2. The last thing I want to point out on this slide is the capacitor between the two inverters. You know, those EE guys are pretty smart sometimes. They draw two parallel lines at the transistor’s gate for a reason--to remind us there is some capacitance at the gate. Therefore the capacitance value of this capacitor come from TWO sources. Capacitance of the wire connecting the gates. And the gate capacitance of the NMOS and PMOS transistors inside the second inverter. +2 = 32 min. (Y:12) Total Propagation Delay = Sum of individual delays = d1 + d2 Capacitance C1 has two components: Capacitance of the wire connecting the two gates Input capacitance of the second inverter 9/5/01 ©UCB Fall 2001

Review: Calculating Delays
Vdd Vdd Vin V1 V2 Vin V1 V2 G1 G2 C1 V3 Vdd V3 G3 Sum delays along serial paths Delay (Vin -> V2) ! = Delay (Vin -> V3) Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2) Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3) Critical Path = The longest among the N parallel paths C1 = Wire C + Cin of Gate 2 + Cin of Gate 3 Now, let’s look at the parallel connection. Once again, I have simplified the picture by using inverters to represent logic gates. In general, the delay from Vin to V2 will NOT be the same as the delay from Vin to V3 because these gates (the two parallel inverters) will be different. So you have to calculate the delay separately to find out which path is slower and the slower of the two is the one you have to worry about. In general, if you have N parallel paths, the longest among the N parallel paths is called the critical path. The term critical path will come up later in this lecture. For now, I want to divert your attention to capacitor C1 which has 3 components: a. First, the obvious part: the capacitance of the wire. b. Then, you also have the input capacitance of Inverter 2, that is the gate capacitance of THESE two transistors. c. Finally, the input capacitance of Inverter 3 (points to the NMOS and PMOS gates). You must include all these components in order to model the delay accurately. +2 min. = 34 min. (Y:14) 9/5/01 ©UCB Fall 2001

Review: General C/L Cell Delay Model
Vout Cout Delay Va -> Vout X Ccritical A B Combinational Logic Cell . Cout X delay per unit load Internal Delay Combinational Cell (symbol) is fully specified by: functional (input -> output) behavior truth-table, logic equation, VHDL load factor of each input critical propagation delay from each input to each output for each transition THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load Linear model composes So far, we have been talking about delay qualitatively. Now, I will show you how to look at delay quantitatively. Imagine you have a multiple input combinational logic gate. In order to quantify the delay from Input A to output, this is what we will do:: (a) First, set all other inputs in a way such that a change in A will cause a change in output. For example, if this is a AND gate, you should set Inputs B through X to 1. (b) Then connect a capacitor to the output of the gate and measure the delay. As you increase the capacitance of this capacitor, you will notice that the delay will keep on increasing as well. At some point (point to the curve starting going non-linear), you will say: “Well, if this gate has to drive any capacitance bigger than this, we are in trouble.” You then put a note in your on-line notebook reminding yourself and everybody that NEVER EVER try to use this gate to drive anything bigger than Ccritical. For any capacitance that is less than Ccritical, you can keep the model simple by drawing a straight line through your data points. By interpolating your linear plot, you can find out the delay even if you have zero capacitance at the output (impossible situation since all wires have capacitance). This (zero intercept) is called the “Internal Delay.” The slope of the line is called the “Load Dependent Delay:” For all output capacitance that is less than the infamous Ccritical, you can quantify the delay from Input A to output by this linear equation. +3 = 37 min. (Y:17) 9/5/01 ©UCB Fall 2001

Characterize a Gate Input capacitance for each input
For each input-to-output path: For each output transition type (H->L, L->H, H->Z, L->Z ... etc.) Internal delay (ns) Load dependent delay (ns / fF) Example: 2-input NAND Gate Delay A -> Out Out: Low -> High Out A B On the last slide, I showed you how to quantify the delay from one input to the output. This is only part of the story as far as characterizing a logic gate is concerned. Besides the delay, you also need to tell the user the input capacitance of each input. As far as delay is concerned, we need to specify the delay from EACH input to the output. Furthermore, for EACH input to output path, we also need to specify the Internal and Load Dependent delay for EACH output transition. The output can go from Low to High or High to Low. These are the obvious ones! The not so obvious ones (High to Z, Low to Z) are for logic gates whose outputs can go to the High Impedance state (Z states). Let’s look at the NAND gates we have in the CS152 library: (a) For both inputs A and B, the input capacitance is 61fF (fF = 10 ^ -12F). (b) The Internal and Load Dependent delay are the same for either the A to Output or the B to Output path. For example, if we consider the case where the output has to go from Low to High, the linear equation will have an Internal Delay of 0.5ns and a Slope equals to ns per fempto F. +3 = 40 min. (Y:20) For A and B: Input Load (I.L.) = 61 fF For either A -> Out or B -> Out: Tlh = 0.5ns Tlhf = ns / fF Thl = 0.1ns Thlf = ns / fF Slope = 0.0021ns / fF 0.5ns Cout 9/5/01 ©UCB Fall 2001

A Specific Example: 2 to 1 MUX
B S Gate 3 Gate 2 Gate 1 Wire 1 Wire 2 Wire 0 A B Y S 2 x 1 Mux Y = (A and !S) or (B and S) Input Load (I.L.) A, B: I.L. (NAND) = 61 fF S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF Load Dependent Delay (L.D.D.): Same as Gate 3 TAYlhf = ns / fF TAYhlf = ns / fF TBYlhf = ns / fF TBYhlf = ns / fF TSYlhf = ns / fF TSYlhf = ns / fF Let’s look a more complicated combination logic block. Assume we build a 2-to-1 multiplexer using 3 NAND gates and a inverter. The input capacitance for the MUX’s inputs A and B are pretty straight forward. They will be the same as the NAND gate, that is 61 fF. The input capacitance for input S, however, is slightly more complex. S has to go the input of the inverter AS WELL AS the input of this NAND gate. I have not told you yet but from the data sheet I have, I found that the input capacitance of an inverter is 50fF. Consequently, the input capacitance for S is the sum of the input capacitance of the NAND gate and the input capacitance of the inverter, that is 111 fF. Almost twice as much as the input capacitance of input A and input B. As far as the load dependent delay is concerned, it is rather simple. They will be the same as the numbers we have for Gate 3 because Gate 3 is responsible for driving the output. The hard part is to calculate this BLOCK’s (point to the whole thing) Internal Delay --that is the delay through this entire circuit when the output capacitance is zero. In real life, we build MUX(up to 8 inputs) with pass gates with proper control to ensure that at most 1 is enabled to the common output. This pass-gate-MUX optimizes SPEED and area! +2 = 42 min. (Y:22) 9/5/01 ©UCB Fall 2001

2 to 1 MUX: Internal Delay Calculation
B S Gate 3 Gate 2 Gate 1 Wire 1 Wire 2 Wire 0 Y = (A and !S) or (A and S) Internal Delay (I.D.): A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3 B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3 S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv Internal Delay A to Y We can approximate the effect of “Wire 1 C” by: Assume Wire 1 has the same C as all the gate C attached to it. Let’s look at the internal delay from input A to the output. This delay consists of three parts: (a) The internal Delay of G1. (b) The internal Delay of G3. And last but not least ... (c) The product of Gate 1’s Load Dependent Delay and the total capacitance Gate 1 needs to drive. That is, the input capacitance of Gate 3 as well as the capacitance of Wire 1. The internal delay from input B to the output is similar. The internal delay from input S to the output is the worst. In the worst case scenario, which we have to use, this delay has five components. (a) First, it has the 3 components coming from the path through the 2 NAND gates (A->Y). (b) Then we have to add in the internal delay of the inverter. (c) Finally, we have the delay of the inverter driving the input capacitance of the NAND gate as well as the capacitance of Wire 0. We don’t know the capacitance of the wires unless we examine the layout carefully. One good rule of thumb is add up all the input capacitance that connects to this wire and then multiply it by In other words, we assume the wire C is the same as the total gate C. For example, we can estimate the total capacitance Gate 1 needs to drive to be 2 times the input capacitance of Gate 3. +3 = 45 min. (Y:25) 9/5/01 ©UCB Fall 2001

2 to 1 MUX: Internal Delay Calculation (continue)
B S Gate 3 Gate 2 Gate 1 Wire 1 Wire 2 Wire 0 Y = (A and !S) or (B and S) Internal Delay (I.D.): A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3 B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3 S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv Internal Delay A to Y Specific Example: TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G = 0.1ns fF * ns/fF + 0.5ns = ns Let’s look at a specific example, TAYlh. TAYlh means the internal delay from Input A to output Y with output Y doing a Low to High transition. In order for input A to cause Y to make a Low to High transition, the output of Gate 1 must go from High to Low. That’s why we use the TPhl, NAND gate’s internal delay with output doing high to low transition, and TPhlf, load dependent delay, also with output doing a H -> L transition. The 2.0 factor here is to approximate the extra capacitance caused by Wire 1. This is a rule of thumb type approximation: the wire will have the same capacitance as the amount of gate capacitance attach to it. This makes sense, more gate capacitance attach to a wire means the wire has to go to more gates so the wire probably has to be longer and has more capacitance. In any case, after we substitute all the numbers from the data sheet, we come up with a internal delay, from Input A to Output Y, assuming output Y is doing a Low to High transition, to be 0.844ns. +2 = 47 min. (Y:47) 9/5/01 ©UCB Fall 2001

Abstraction: 2 to 1 MUX Input Load: A = 61 fF, B = 61 fF, S = 111 fF
Y S 2 x 1 Mux Gate 1 Y Gate 3 B Gate 2 S Input Load: A = 61 fF, B = 61 fF, S = 111 fF Load Dependent Delay: TAYlhf = ns / fF TAYhlf = ns / fF TBYlhf = ns / fF TBYhlf = ns / fF TSYlhf = ns / fF TSYlhf = ns / f F Internal Delay: TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3 = 0.1ns fF * ns/fF + 0.5ns = 0.844ns Fun Exercises: TAYhl, TBYlh, TSYlh, TSYlh With all these calculations, we can now abstract the 2-to-1 MUX into this 3 input combinational block. This combinational logic block will have a input capacitance of 61fF on its A and B inputs. The S input, however, will have a much higher input capacitance, 111 fF. The load dependent delay numbers are shown here. Finally, when you go home tonight and if you have nothing better to do, you can finish calculating the internal delay for me. +1 = 48 min. (Y:28) 9/5/01 ©UCB Fall 2001

CS152 Logic Elements NAND2, NAND3, NAND 4 NOR2, NOR3, NOR4
INV1x (normal inverter) INV4x (inverter with large output drive) XOR2 XNOR2 PWR: Source of 1’s GND: Source of 0’s fast MUXes Here is the list of the logic elements you will be using in this class. On the first row, you have the NAND gates with 2 inputs, 3 inputs, and 4 inputs. On the second row, you have the NOR gates with 2 inputs, 3 inputs, and 4 inputs. There are two different inverters. The normal one (INV1x) and the “beefed up” version. The “beefed up” version has approximately an order of magnitude more drive capability: the load dependent delay (TPlh) is ~1/10 of INV1x. The way we build an inverter with this much higher drive is to use bigger transistors. The price we pay for using big transistors is that this inverter will have a much much bigger input capacitance (200 fF vs. 50fF). +2 = 57 min. (Y:37) D flip flop with negative edge triggered 9/5/01 ©UCB Fall 2001

Storage Element’s Timing Model
Clk Setup Hold D Q D Don’t Care Don’t Care Clock-to-Q Q Unknown Setup Time: Input must be stable BEFORE the trigger clock edge Hold Time: Input must REMAIN stable after the trigger clock edge Clock-to-Q time: Output cannot change instantaneously at the trigger clock edge Similar to delay in logic gates, two components: Internal Clock-to-Q Load dependent Clock-to-Q Typical for class: 1ns Setup, 0.5ns Hold So far we have been looking at combinational logic, let’s look at the timing characteristic of a storage element. The storage element you will use is a D type flip-flop trigger on the negative clock edge. In order for the data to latch into the flip flop correctly, the input must be stable slightly before the falling edge of the clock. This time is called the Setup time. After the clock edge has arrived, the data must remain stable for a short amount of time AFTER the trigger clock edge. This is called the hold time. The output cannot change instantaneously at the trigger clock edge. The time it takes for the output to change to its new value after the clock is called the Clock-to-Q time. Similar to delay in logic gates, the Clock-to-Q time has two components: (a) The internal Clock-to-Q time: the time it takes the output to change if output load is zero. (b) And the load dependent Clock-to-Q time. +2 = 50 min. (Y:30) 9/5/01 ©UCB Fall 2001

Clocking Methodology Clk . Combination Logic All storage elements are clocked by the same clock edge The combination logic block’s: Inputs are updated at each clock tick All outputs MUST be stable before the next clock tick All of you should have taken a logic design class so you should know how to do a synchronous design using a clock but let’s have a brief review. In this class, all your designs should only have one clock in it. Furthermore, all storage elements are clocked by the same clock edge, namely the falling clock edge. You should NOT try to use both edges of the clock nor try to use any Flip Flops that are level sensitive instead of edge sensitive. (This is reserved for real world desingers!) If you follow this clocking methodology (All storage elements are clocked ....), then the inputs to your combinational logic blocks will come from the outputs of some registers or externally. Consequently, they are updated at each clock tick. On the other side of the combination logic block, the outputs will be saved in another register. Therefore, the outputs must be stable before the next clock tick. +2 = 63 min. (Y:43) 9/5/01 ©UCB Fall 2001

Critical Path & Cycle Time
Clk . . Critical path: the slowest path between any two storage devices Cycle time is a function of the critical path must be greater than: Clock-to-Q + Longest Path through Combination Logic + Setup If you follow this simple clocking methodology which uses the SAME clock edge for all storage devices, the critical path of your design is easy (well at least in theory) to identify. More specifically, the critical path of your design is the slowest path from one storage device to another through the combination logic. The cycle time of your design is a function of this critical path, or more specifically, the cycle time must be greater than the sum of: (a) The Clock-to-Q time of the input register. (b) The longest delay through the combination logic. (c) And the Setup time of the next register. The key words here are “greater than” because if you set the clock cycle time to this, chances are things will work most of the time but will fail occasionally-usually it will fail when you have to run your demo to you customers. The additional thing you need to worry about is clock skew. That is due to different delay on the clock distribution network, the two storage devices may end up seeing two slightly different clocks. +3 = 65 min. (Y:45) 9/5/01 ©UCB Fall 2001

Clock Skew’s Effect on Cycle Time
Clk1 Clock Skew Clk2 . . Clk1 Clk2 Let’s look at an example here. Consider the worst case scenario where the input register sees the clock signal Clock One. Due to the different delay through different parts of the clock distribution network, the output register sees the clock signal Clock Two (CLK2). Here (points to Clock Skew) I have shown you that Clock Two will arrive the output register Slightly earlier than Clock One arrives at the input Register. Consequently, the minimum cycle time for this circuit to work is the sum of: (a) The Clock-to-Q time of the input register. (b) The longest delay path through the combination logic. (c) The Setup time of the output register. (d) And the purpose of this slide, the clock skew of the clock distribution network. In your homework and lab assignments, you probably will be using a relatively slow clock so clock skew is probably not a big problem. After you graduate, you may be lucky enough to find a job to work on some very high speed digital design, then the Clock Skew can be a major problem. (clock skew is usually kept <10% of the cycle time in very high speed system). In those high speed designs, if you are not careful, the sum of the Clock-to-Q time, the Setup time, and the Clock skew can become a major part of your cycle time. Notice that, if your Flip Flops have lousy Clock-to-Q and Setup times and your clock distribution is so poorly design that clock skew is big, then even if you can have the fastest logic gates in the world, you still will not have a super fast design. You can slow down the clock to “fix” a setup violation; there is not a whole lot you can do about hold time problem! +3 = 68 min. (Y:48) The worst case scenario for cycle time consideration: The input register sees CLK1 The output register sees CLK2 Cycle Time - Clock Skew  CLK-to-Q + Longest Delay + Setup  Cycle Time  CLK-to-Q + Longest Delay + Setup + Clock Skew 9/5/01 ©UCB Fall 2001

Tricks to Reduce Cycle Time
Reduce the number of gate levels A A B B C C D D Review Karnaugh maps for prereq quiz! Use esoteric/dynamic timing methods Pay attention to loading One gate driving many gates is a bad idea Avoid using a small gate to drive a long wire Use multiple stages to drive large load Here are some common tricks you can use to reduce the cycle time. The most obvious way is to reduce the number of logic levels. Then you should also pay attention to loading. That is you should: (a) Avoid using one small gate driving a large number of other gates. (b) Also avoid using a small gate to drive a long wire. Whenever you have to drive a large capacitance, you should use multiple stages to drive it. (c) take advantage of the difference between the type of gate and choice of active high or low signalling convention. (d) advanced circuit design techniques such as dynamic circuitry and precharging. (e) “cycle stealing”. +1 = 69 min. (Y:49) INV4x Clarge INV4x 9/5/01 ©UCB Fall 2001

How to Avoid Hold Time Violation?
Clk . Combination Logic Hold time requirement: Input to register must NOT change immediately after the clock tick This is usually easy to meet in the “edge trigger” clocking scheme Hold time of most FFs is <= 0 ns CLK-to-Q + Shortest Delay Path must be greater than Hold Time So far our cycle time consideration pretty much aimed at meeting the setup time requirement. That is, we want to make sure our cycle time is LONG enough that the signal, coming from the input registers, can propagate through the combination logic, and arrive at the output register at least ONE “Setup” time before the next clock tick. Now you may ask yourself: “How about the hold time requirement?” Recall the hold time requirement states that the input to a register (points to the output register), MUST NOT change until one “Hold Time” AFTER the clock tick. This is usually easy to meet in our clocking scheme in which all storage devices are triggered on the SAME clock edge. More specifically, if you look at this diagram carefully, you will see that as long as the sum of: (a) The Clock-to-Q time of the input register. (b) The SHORTEST delay path through the combination block. Are more than the hold time of the output registers, then NONE of these outputs will change BEFORE one “hold time” after the clock tick and we will have no Hold Time Problem. Since the Clock-to-Q time of our storage device is at lest 1.5ns, which is much bigger than the Hold Time (0.5ns), we should NEVER have any hold time violation. Well, that is we should NOT have any hold time violation as long as we don’t have ANY clock skew. +3 = 72 min. (Y:52) 9/5/01 ©UCB Fall 2001

Clock Skew’s Effect on Hold Time
Clk1 Clock Skew Clk2 . Combination Logic Clk2 Clk1 But in the real world, there will be some clock skew. How will clock skew affect your hold time consideration? Once again, let’s look at the worst case scenario. As far as Hold Time consideration is concerned, the worst case scenario occurs where the input register sees the clock signal Clock Two (CLK2). And due to the different delay through different parts of the clock distribution network, the output register sees the clock signal Clock One (CLK1). Here (points to Clock Skew) I have shown you that Clock Two will arrive the input register Slightly earlier than Clock One arrives at the output Register. Consequently, we have to make sure AFTER we subtract the Clock Skew for the sum of: (a) The Clock-to-Q time of the input register. (b) The shortest delay path through the combination logic. We STILL have a time GREATER than the hold time requirement of the output registers. +2 = 74 min. (Y:54) The worst case scenario for hold time consideration: The input register sees CLK2 The output register sees CLK1 fast FF2 output must not change input to FF1 for same clock edge (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time 9/5/01 ©UCB Fall 2001

Summary Total execution time is the most reliable measure of performance Amdall’s law: Law of Diminishing Returns Performance and Technology Trends Keep the design simple (KISS rule) to take advantage of the latest technology CMOS inverter and CMOS logic gates Delay Modeling and Gate Characterization Delay = Internal Delay + (Load Dependent Delay x Output Load) Clocking Methodology and Timing Considerations Simplest clocking methodology All storage elements use the SAME clock edge Cycle Time  CLK-to-Q + Longest Delay Path + Setup + Clock Skew (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time Let me summarize today’s lecture. The first topic we covered today is Performance and Technology Trends. The big lesson there is that technology is advancing so fast that the best design may be the simplest design, Because the simplest design may allow you to take advantage of the most advance technology that may not be feasible for a complex design. The most popular technology today is CMOS and I showed you the basic operation principles behind the CMOS inverters and the NAND and NOR gates. The second topic we covered today is Delay Modeling and Gate Characterization. The most important lesson here is that we can model the delay of any combinational logic using a linear equation. This linear equation has two components: the internal delay and the load dependent delay. The last topic we converted today is clocking methodology. Here, I STRONGLY recommend you to use the simplest clocking scheme where ALL storage element in your design are triggered by the SAME clock edge. Based on this simple clocking scheme, your cycle time will be the sum of: (a) The Clock-to-Q time of the input register. (b) The longest delay path through the combination logic. (c) The Set up time of the output register. (d) And any possible clock skew in the system. Finally you can avoid hold time violation IF youClock-to-Q + logic - skew > Hold time. +3 = 77 min. (Y:57) 9/5/01 ©UCB Fall 2001

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (http.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (http.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback