CS152 Computer Architecture and Engineering Lecture 4 Cost and Design

CS152 Computer Architecture and Engineering Lecture 4 Cost and Design
September 12, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: 9/12/01 ©UCB Fall 2001

Review: Performance and Technology Trends
0.1 1 10 100 1000 1965 1970 1975 1980 1985 1990 1995 2000 Microprocessors Minicomputers Mainframes Supercomputers Year Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr. Density: improves 1.2x / yr. Die Area: 1.2x / yr. RISC lesson is to keep the ISA as simple as possible: Shorter design cycle => fully exploit the advancing technology (~3yr) Advanced branch prediction and pipeline techniques Bigger and more sophisticated on-chip caches Recall the performance chart from the first lecture: performance of all computers advance in a rapid pace. Even though computer architects like to think that this rapid increase in performance is caused by their clever ideas. Deep down, almost every one agrees that this rapid performance growth is caused by the technology that behinds it. Here are some estimates to give you an idea on how rapidly the technology is evolving: a. The feature size, that is the size of the transistor, is shrinking about 10% a year. Smaller transistors means faster switching. b. Technology advances also enables us to pack ~20% more components into the same area every year. c. Last but not least, we are able to manufacture chips that are 20% bigger every year. Consequently, the advance in technology is enabling us to have 1.2 to the power of 3, or 1.7 times more computer power every year. +2 = 6 min. (X:46) 9/12/01 ©UCB Fall 2001

Review: Characterize a Gate
Input capacitance for each input For each input-to-output path: For each output transition type (H->L, L->H, H->Z, L->Z ... etc.) Internal delay (ns) Load dependent delay (ns / fF) Example: 2-input NAND Gate Delay A -> Out Out: Low -> High Out A B On the last slide, I showed you how to quantify the delay from one input to the output. This is only part of the story as far as characterizing a logic gate is concerned. Besides the delay, you also need to tell the user the input capacitance of each input. As far as delay is concerned, we need to specify the delay from EACH input to the output. Furthermore, for EACH input to output path, we also need to specify the Internal and Load Dependent delay for EACH output transition. The output can go from Low to High or High to Low. These are the obvious ones! The not so obvious ones (High to Z, Low to Z) are for logic gates whose outputs can go to the High Impedance state (Z states). Let’s look at the NAND gates we have in the CS152 library: (a) For both inputs A and B, the input capacitance is 61fF (fF = 10 ^ -12F). (b) The Internal and Load Dependent delay are the same for either the A to Output or the B to Output path. For example, if we consider the case where the output has to go from Low to High, the linear equation will have an Internal Delay of 0.5ns and a Slope equals to ns per fempto F. +3 = 40 min. (Y:20) For A and B: Input Load (I.L.) = 61 fF For either A -> Out or B -> Out: Tlh = 0.5ns Tlhf = ns / fF Thl = 0.1ns Thlf = ns / fF Slope = 0.0021ns / fF 0.5ns Cout 9/12/01 ©UCB Fall 2001

Review: General C/L Cell Delay Model
Vout Cout Delay Va -> Vout X Ccritical A B Combinational Logic Cell . Cout X delay per unit load Internal Delay Combinational Cell (symbol) is fully specified by: functional (input -> output) behavior truth-table, logic equation, VHDL load factor of each input critical propagation delay from each input to each output for each transition THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load Linear model composes So far, we have been talking about delay qualitatively. Now, I will show you how to look at delay quantitatively. Imagine you have a multiple input combinational logic gate. In order to quantify the delay from Input A to output, this is what we will do:: (a) First, set all other inputs in a way such that a change in A will cause a change in output. For example, if this is a AND gate, you should set Inputs B through X to 1. (b) Then connect a capacitor to the output of the gate and measure the delay. As you increase the capacitance of this capacitor, you will notice that the delay will keep on increasing as well. At some point (point to the curve starting going non-linear), you will say: “Well, if this gate has to drive any capacitance bigger than this, we are in trouble.” You then put a note in your on-line notebook reminding yourself and everybody that NEVER EVER try to use this gate to drive anything bigger than Ccritical. For any capacitance that is less than Ccritical, you can keep the model simple by drawing a straight line through your data points. By interpolating your linear plot, you can find out the delay even if you have zero capacitance at the output (impossible situation since all wires have capacitance). This (zero intercept) is called the “Internal Delay.” The slope of the line is called the “Load Dependent Delay:” For all output capacitance that is less than the infamous Ccritical, you can quantify the delay from Input A to output by this linear equation. +3 = 37 min. (Y:17) 9/12/01 ©UCB Fall 2001

Review: More complicated gates
B Y S 2 x 1 Mux Three Components: Input Load Load Dependent Delay Internal Delays One for each input pathoutput transition Input Load: A = 61 fF, B = 61 fF, S = 111 fF Load Dependent Delay: TAYlhf = ns / fF TAYhlf = ns / fF TBYlhf = ns / fF TBYhlf = ns / fF TSYlhf = ns / fF TSYlhf = ns / f F Internal Delay: TAYlh = 0.844ns TBYlh = 0.844ns Fun Exercises: TAYhl, TBYhl, TSYlh, TSYlh How do we compute these numbers? With all these calculations, we can now abstract the 2-to-1 MUX into this 3 input combinational block. This combinational logic block will have a input capacitance of 61fF on its A and B inputs. The S input, however, will have a much higher input capacitance, 111 fF. The load dependent delay numbers are shown here. Finally, when you go home tonight and if you have nothing better to do, you can finish calculating the internal delay for me. +1 = 48 min. (Y:28) 9/12/01 ©UCB Fall 2001

2 to 1 MUX: Input Load and Load Dependent Delay
B S Gate 3 Gate 2 Gate 1 Wire 1 Wire 2 Wire 0 A B Y S 2 x 1 Mux Y = (A and !S) or (B and S) Input Load (I.L.) A, B: I.L. (NAND) = 61 fF S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF Load Dependent Delay (L.D.D.): Same as Gate 3 TAYlhf = ns / fF TAYhlf = ns / fF TBYlhf = ns / fF TBYhlf = ns / fF TSYlhf = ns / fF TSYlhf = ns / fF Let’s look a more complicated combination logic block. Assume we build a 2-to-1 multiplexer using 3 NAND gates and a inverter. The input capacitance for the MUX’s inputs A and B are pretty straight forward. They will be the same as the NAND gate, that is 61 fF. The input capacitance for input S, however, is slightly more complex. S has to go the input of the inverter AS WELL AS the input of this NAND gate. I have not told you yet but from the data sheet I have, I found that the input capacitance of an inverter is 50fF. Consequently, the input capacitance for S is the sum of the input capacitance of the NAND gate and the input capacitance of the inverter, that is 111 fF. Almost twice as much as the input capacitance of input A and input B. As far as the load dependent delay is concerned, it is rather simple. They will be the same as the numbers we have for Gate 3 because Gate 3 is responsible for driving the output. The hard part is to calculate this BLOCK’s (point to the whole thing) Internal Delay --that is the delay through this entire circuit when the output capacitance is zero. In real life, we build MUX(up to 8 inputs) with pass gates with proper control to ensure that at most 1 is enabled to the common output. This pass-gate-MUX optimizes SPEED and area! +2 = 42 min. (Y:22) 9/12/01 ©UCB Fall 2001

2 to 1 MUX: Internal Delay Calculation
B S Gate 3 Gate 2 Gate 1 Wire 1 Wire 2 Wire 0 Y = (A and !S) or (A and S) Internal Delay (I.D.): A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3 B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3 S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv Internal Delay A to Y We can approximate the effect of “Wire 1 C” by: Assume Wire 1 has the same C as all the gate C attached to it. Specific Example: TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G = 0.1ns fF * ns/fF + 0.5ns = ns Let’s look at the internal delay from input A to the output. This delay consists of three parts: (a) The internal Delay of G1. (b) The internal Delay of G3. And last but not least ... (c) The product of Gate 1’s Load Dependent Delay and the total capacitance Gate 1 needs to drive. That is, the input capacitance of Gate 3 as well as the capacitance of Wire 1. The internal delay from input B to the output is similar. The internal delay from input S to the output is the worst. In the worst case scenario, which we have to use, this delay has five components. (a) First, it has the 3 components coming from the path through the 2 NAND gates (A->Y). (b) Then we have to add in the internal delay of the inverter. (c) Finally, we have the delay of the inverter driving the input capacitance of the NAND gate as well as the capacitance of Wire 0. We don’t know the capacitance of the wires unless we examine the layout carefully. One good rule of thumb is add up all the input capacitance that connects to this wire and then multiply it by In other words, we assume the wire C is the same as the total gate C. For example, we can estimate the total capacitance Gate 1 needs to drive to be 2 times the input capacitance of Gate 3. +3 = 45 min. (Y:25) 9/12/01 ©UCB Fall 2001

CS152 Logic Elements NAND2, NAND3, NAND 4 NOR2, NOR3, NOR4
INV1x (normal inverter) INV4x (inverter with large output drive) XOR2 XNOR2 PWR: Source of 1’s GND: Source of 0’s fast MUXes Here is the list of the logic elements you will be using in this class. On the first row, you have the NAND gates with 2 inputs, 3 inputs, and 4 inputs. On the second row, you have the NOR gates with 2 inputs, 3 inputs, and 4 inputs. There are two different inverters. The normal one (INV1x) and the “beefed up” version. The “beefed up” version has approximately an order of magnitude more drive capability: the load dependent delay (TPlh) is ~1/10 of INV1x. The way we build an inverter with this much higher drive is to use bigger transistors. The price we pay for using big transistors is that this inverter will have a much much bigger input capacitance (200 fF vs. 50fF). +2 = 57 min. (Y:37) D flip flop with negative edge triggered 9/12/01 ©UCB Fall 2001

Storage Element’s Timing Model
Clk Setup Hold D Q D Don’t Care Don’t Care Clock-to-Q Q Unknown Setup Time: Input must be stable BEFORE the trigger clock edge Hold Time: Input must REMAIN stable after the trigger clock edge Clock-to-Q time: Output cannot change instantaneously at the trigger clock edge Similar to delay in logic gates, two components: Internal Clock-to-Q Load dependent Clock-to-Q Typical for class: 1ns Setup, 0.5ns Hold So far we have been looking at combinational logic, let’s look at the timing characteristic of a storage element. The storage element you will use is a D type flip-flop trigger on the negative clock edge. In order for the data to latch into the flip flop correctly, the input must be stable slightly before the falling edge of the clock. This time is called the Setup time. After the clock edge has arrived, the data must remain stable for a short amount of time AFTER the trigger clock edge. This is called the hold time. The output cannot change instantaneously at the trigger clock edge. The time it takes for the output to change to its new value after the clock is called the Clock-to-Q time. Similar to delay in logic gates, the Clock-to-Q time has two components: (a) The internal Clock-to-Q time: the time it takes the output to change if output load is zero. (b) And the load dependent Clock-to-Q time. +2 = 50 min. (Y:30) 9/12/01 ©UCB Fall 2001

Clocking Methodology Clk . Combination Logic All storage elements are clocked by the same clock edge The combination logic block’s: Inputs are updated at each clock tick All outputs MUST be stable before the next clock tick All of you should have taken a logic design class so you should know how to do a synchronous design using a clock but let’s have a brief review. In this class, all your designs should only have one clock in it. Furthermore, all storage elements are clocked by the same clock edge, namely the falling clock edge. You should NOT try to use both edges of the clock nor try to use any Flip Flops that are level sensitive instead of edge sensitive. (This is reserved for real world desingers!) If you follow this clocking methodology (All storage elements are clocked ....), then the inputs to your combinational logic blocks will come from the outputs of some registers or externally. Consequently, they are updated at each clock tick. On the other side of the combination logic block, the outputs will be saved in another register. Therefore, the outputs must be stable before the next clock tick. +2 = 63 min. (Y:43) 9/12/01 ©UCB Fall 2001

Critical Path & Cycle Time
Clk . . Critical path: the slowest path between any two storage devices Cycle time is a function of the critical path must be greater than: Clock-to-Q + Longest Path through Combination Logic + Setup If you follow this simple clocking methodology which uses the SAME clock edge for all storage devices, the critical path of your design is easy (well at least in theory) to identify. More specifically, the critical path of your design is the slowest path from one storage device to another through the combination logic. The cycle time of your design is a function of this critical path, or more specifically, the cycle time must be greater than the sum of: (a) The Clock-to-Q time of the input register. (b) The longest delay through the combination logic. (c) And the Setup time of the next register. The key words here are “greater than” because if you set the clock cycle time to this, chances are things will work most of the time but will fail occasionally-usually it will fail when you have to run your demo to you customers. The additional thing you need to worry about is clock skew. That is due to different delay on the clock distribution network, the two storage devices may end up seeing two slightly different clocks. +3 = 65 min. (Y:45) 9/12/01 ©UCB Fall 2001

Clock Skew’s Effect on Cycle Time
Clk1 Clock Skew Clk2 . . Clk1 Clk2 The worst case scenario for cycle time consideration: The input register sees CLK1 The output register sees CLK2 Cycle Time - Clock Skew  CLK-to-Q + Longest Delay + Setup  Cycle Time  CLK-to-Q + Longest Delay + Setup + Clock Skew Let’s look at an example here. Consider the worst case scenario where the input register sees the clock signal Clock One. Due to the different delay through different parts of the clock distribution network, the output register sees the clock signal Clock Two (CLK2). Here (points to Clock Skew) I have shown you that Clock Two will arrive the output register Slightly earlier than Clock One arrives at the input Register. Consequently, the minimum cycle time for this circuit to work is the sum of: (a) The Clock-to-Q time of the input register. (b) The longest delay path through the combination logic. (c) The Setup time of the output register. (d) And the purpose of this slide, the clock skew of the clock distribution network. In your homework and lab assignments, you probably will be using a relatively slow clock so clock skew is probably not a big problem. After you graduate, you may be lucky enough to find a job to work on some very high speed digital design, then the Clock Skew can be a major problem. (clock skew is usually kept <10% of the cycle time in very high speed system). In those high speed designs, if you are not careful, the sum of the Clock-to-Q time, the Setup time, and the Clock skew can become a major part of your cycle time. Notice that, if your Flip Flops have lousy Clock-to-Q and Setup times and your clock distribution is so poorly design that clock skew is big, then even if you can have the fastest logic gates in the world, you still will not have a super fast design. You can slow down the clock to “fix” a setup violation; there is not a whole lot you can do about hold time problem! +3 = 68 min. (Y:48) 9/12/01 ©UCB Fall 2001

Tricks to Reduce Cycle Time
Reduce the number of gate levels A A B B C C D D Use esoteric/dynamic timing methods Pay attention to loading One gate driving many gates is a bad idea Avoid using a small gate to drive a long wire Use multiple stages to drive large load Here are some common tricks you can use to reduce the cycle time. The most obvious way is to reduce the number of logic levels. Then you should also pay attention to loading. That is you should: (a) Avoid using one small gate driving a large number of other gates. (b) Also avoid using a small gate to drive a long wire. Whenever you have to drive a large capacitance, you should use multiple stages to drive it. (c) take advantage of the difference between the type of gate and choice of active high or low signalling convention. (d) advanced circuit design techniques such as dynamic circuitry and precharging. (e) “cycle stealing”. +1 = 69 min. (Y:49) INV4x Clarge INV4x 9/12/01 ©UCB Fall 2001

How to Avoid Hold Time Violation?
Clk . Combination Logic Hold time requirement: Input to register must NOT change immediately after the clock tick This is usually easy to meet in the “edge trigger” clocking scheme Hold time of most FFs is <= 0 ns CLK-to-Q + Shortest Delay Path must be greater than Hold Time So far our cycle time consideration pretty much aimed at meeting the setup time requirement. That is, we want to make sure our cycle time is LONG enough that the signal, coming from the input registers, can propagate through the combination logic, and arrive at the output register at least ONE “Setup” time before the next clock tick. Now you may ask yourself: “How about the hold time requirement?” Recall the hold time requirement states that the input to a register (points to the output register), MUST NOT change until one “Hold Time” AFTER the clock tick. This is usually easy to meet in our clocking scheme in which all storage devices are triggered on the SAME clock edge. More specifically, if you look at this diagram carefully, you will see that as long as the sum of: (a) The Clock-to-Q time of the input register. (b) The SHORTEST delay path through the combination block. Are more than the hold time of the output registers, then NONE of these outputs will change BEFORE one “hold time” after the clock tick and we will have no Hold Time Problem. Since the Clock-to-Q time of our storage device is at lest 1.5ns, which is much bigger than the Hold Time (0.5ns), we should NEVER have any hold time violation. Well, that is we should NOT have any hold time violation as long as we don’t have ANY clock skew. +3 = 72 min. (Y:52) 9/12/01 ©UCB Fall 2001

Clock Skew’s Effect on Hold Time
Clk1 Clock Skew Clk2 . Combination Logic Clk2 Clk1 The worst case scenario for hold time consideration: The input register sees CLK2 The output register sees CLK1 fast FF2 output must not change input to FF1 for same clock edge (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time But in the real world, there will be some clock skew. How will clock skew affect your hold time consideration? Once again, let’s look at the worst case scenario. As far as Hold Time consideration is concerned, the worst case scenario occurs where the input register sees the clock signal Clock Two (CLK2). And due to the different delay through different parts of the clock distribution network, the output register sees the clock signal Clock One (CLK1). Here (points to Clock Skew) I have shown you that Clock Two will arrive the input register Slightly earlier than Clock One arrives at the output Register. Consequently, we have to make sure AFTER we subtract the Clock Skew for the sum of: (a) The Clock-to-Q time of the input register. (b) The shortest delay path through the combination logic. We STILL have a time GREATER than the hold time requirement of the output registers. +2 = 74 min. (Y:54) 9/12/01 ©UCB Fall 2001

Integrated Circuit Costs
Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer =  * ( Wafer_diam / 2)2 –  * Wafer_diam – Test dies  Wafer Area Die Area  2 * Die Area Die Area Die Yield = Wafer yield Die yield: assume defects are randomly, distributed,  { 1+ Defects_per_unit_area * Die_Area  } Die Cost is goes roughly with (die area)3 or (die area)4 9/12/01 ©UCB Fall 2001

Good Dice Per Wafer (Before Testing!)
Die Yield Raw Dice Per Wafer wafer diameter die area (mm2) 6”/15cm 8”/20cm 10”/25cm die yield 23% 19% 16% 12% 11% 10% typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer Good Dice Per Wafer (Before Testing!) 6”/15cm 8”/20cm 10”/25cm typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000 9/12/01 ©UCB Fall 2001

Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX $ % $4 486DX $ % $12 PowerPC $ % $53 HP PA $ % $73 DEC Alpha $ % $149 SuperSPARC $ % $272 Pentium $ % $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15 Redo table for smaller line width and with “volume” figures. 9/12/01 ©UCB Fall 2001

Packaging Cost: depends on pins, heat dissipation
Other Costs IC cost = Die cost Testing cost Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation Chip Die Package Test & Total cost pins type cost Assembly 386DX $ QFP $1 $4 $9 486DX2 $ PGA $11 $12 $35 PowerPC 601 $ QFP $3 $21 $77 HP PA $ PGA $35 $16 $124 DEC Alpha $ PGA $30 $23 $202 SuperSPARC $ PGA $20 $34 $326 Pentium $ PGA $19 $37 $473 9/12/01 ©UCB Fall 2001

Administrative Matters
Prerequisite exam results: fairly good! If got  55, must talk with me! Everyone else automatically in class Some problems with robot state machine Trick: for left-wall algorithm, after each move forward: Turn left. If unblocked, move forward Else turn right. If unblocked, move forward Else turn right. If unblocked, move forward (guaranteed) Opposite for right-wall algorithm Read Chapter 4: ALU, Multiply, Divide, FP Mult Hope to have problems with lab fixed Extension to Wednesday, 9/17 for Lab 2 Get going on your testing methodology! You will be graded on how thorough your testing technique is in addition to whether or not you get the bugs Sections a bit unbalanced Can we get more people in the 2-4 section? 9/12/01 ©UCB Fall 2001

The Design Process "To Design Is To Represent"
Design activity yields description/representation of an object -- Traditional craftsman does not distinguish between the conceptualization and the artifact -- Separation comes about because of complexity -- The concept is captured in one or more representation languages -- This process IS design Design Begins With Requirements One way to think about design is: To Design is to Represent. The result of your design activity is a representation of the object you are designing. Every design begins with a set of requirements. First is the functional requirement: a statement of what it will do. WHAT Then there is a list of performance requirements: what speed it needs to run, how much power it is allowed to consume, how big can your design be, how much will it cost ... etc. COST/PERFORMANCE +1 = 5 min. (X:45) -- Functional Capabilities: what it will do -- Performance Characteristics: Speed, Power, Area, Cost, . . . 9/12/01 ©UCB Fall 2001

Design is a "creative process," not a simple method
Design Process (cont.) Design Finishes As Assembly CPU -- Design understood in terms of components and how they have been assembled -- Top Down decomposition of complex functions (behaviors) into more primitive functions -- bottom-up composition of primitive building blocks into more complex assemblies Datapath Control ALU Regs Shifter Nand Gate One of the fun part about being a designer is that you got to be a little kid playing “lego” again, except this time, you will be designing the building blocks as well as putting the building blocks together. The two approaches you will use are: Top Down and Bottom Up. You use the Top Down approach to decompose complex function into primitive functions. After the primitive functions are implemented, you then need to integrate them back together to implement the original complex function. For example, when you design a CPU, you use the top down approach to break the CPU into these primitive blocks. Once you have these blocks implemented, you then put them together to form the CPU. This is pretty clean cut. In many other design problems, you cannot just apply the top-down and then bottom up once. You need to repeat the process several times because design is a creative process, NOT a simple method. Top-Down & Bottom-up together +2 = 7 min. (X:47) Design is a "creative process," not a simple method 9/12/01 ©UCB Fall 2001

Design Refinement Informal System Requirement Initial Specification
Intermediate Specification Final Architectural Description Intermediate Specification of Implementation Final Internal Specification Physical Implementation refinement increasing level of detail Unless you are a real genius like Mozart who could do everything right the first time, the “creative” process means a successive of refinement. So you would start out with an informal system requirement and keep on refining it until you have your final physical implementation. As you refine your design, you will keep on increasing the level of details in the specification. High level function (add) to gates +1 = 8 min. (X:48) 9/12/01 ©UCB Fall 2001

Design as Search Feasible (good) choices vs. Optimal choices Problem A
Strategy 1 Strategy 2 SubProb2 SubProb3 SubProb 1 BB1 BB2 BB3 BBn Design involves educated guesses and verification -- Given the goals, how should these be prioritized? -- Given alternative design pieces, which should be selected? -- Given design space of components & assemblies, which part will yield the best solution? Feasible (good) choices vs. Optimal choices One way to think about the design process is that it is a search for the proper solution through the design space (point to the diagram). How do you know where to find the proper solution? Well usually you don’t. What you need to do is make educated guesses and then verify whether your guesses are correct. If you are correct, you congratulate yourself. You you are wrong, try again. You will have a set of design goals: some are given to you by your supervisors and some may set some for you own. In any case, with a set of goals and some may be contradicting, you must learn how to prioritize them. The way to remember about design is that there are many ways to do the same thing. There is really no such thing as the absolute “right way” to do certain things. Ideally, we like to always pick the best solution but remember your goal should be best solution for the ORIGINAL problem (Problem A). For the Sub-problems down here, you may not need to have to make the optimal choice each time. Sometimes all you need is a reasonable good choice. It takes design time to do best choice at every level, and if run out of design time can jeopardize project in fast moving technology If you have a choice that is good enough for a sub-problem, you should be happy with it and move onto other sub-problems that require your attention. Remember, even the world’s fastest ALU will not do you any good unless you have an equally fast controller to controls it. +3 = 11 min. (X:51) 9/12/01 ©UCB Fall 2001

Problem: Design a “fast” ALU for the MIPS ISA
Requirements? Must support the Arithmetic / Logic operations Tradeoffs of cost and speed based on frequency of occurrence, hardware budget 9/12/01 ©UCB Fall 2001

Add, AddU, Sub, SubU, AddI, AddIU And, Or, AndI, OrI, Xor, Xori, Nor
MIPS ALU requirements Add, AddU, Sub, SubU, AddI, AddIU => 2’s complement adder/sub with overflow detection And, Or, AndI, OrI, Xor, Xori, Nor => Logical AND, logical OR, XOR, nor SLTI, SLTIU (set less than) => 2’s complement adder with inverter, check sign bit of result ALU from from CS 150 / P&H book chapter 4 supports these ops 9/12/01 ©UCB Fall 2001

MIPS arithmetic instruction format
31 25 20 15 5 R-type: op Rs Rt Rd funct I-Type: op Rs Rt Immed 16 Type op funct ADDI 10 xx ADDIU 11 xx SLTI 12 xx SLTIU 13 xx ANDI 14 xx ORI 15 xx XORI 16 xx LUI 17 xx Type op funct ADD 00 40 ADDU 00 41 SUB 00 42 SUBU 00 43 AND 00 44 OR 00 45 XOR 00 46 NOR 00 47 Type op funct 00 50 00 51 SLT 00 52 SLTU 00 53 Signed arith generate overflow, no carry 9/12/01 ©UCB Fall 2001

Design Trick: divide & conquer
Break the problem into simpler problems, solve them and glue together the solution Example: assume the immediates have been taken care of before the ALU 10 operations (4 bits) 00 add 01 addU 02 sub 03 subU 04 and 05 or 06 xor 07 nor 12 slt 13 sltU 9/12/01 ©UCB Fall 2001

Refined Requirements ALU (1) Functional Specification
inputs: 2 x 32-bit operands A, B, 4-bit mode outputs: 32-bit result S, 1-bit carry, 1 bit overflow operations: add, addu, sub, subu, and, or, xor, nor, slt, sltU (2) Block Diagram (powerview symbol, VHDL entity) 32 32 A B 4 ALU c m ovf S 32 9/12/01 ©UCB Fall 2001

Behavioral Representation: VHDL
Entity ALU is generic (c_delay: integer := 20 ns; S_delay: integer := 20 ns); port ( signal A, B: in vlbit_vector (0 to 31); signal m: in vlbit_vector (0 to 3); signal S: out vlbit_vector (0 to 31); signal c: out vlbit; signal ovf: out vlbit) end ALU; . . . C_delay is the carry delay C_delay is the day fdor the sum (S) Some signals are bit vectors(A,B,S,m), some are single bit(c,ovflw) S <= A + B; 9/12/01 ©UCB Fall 2001

Bit slice with carry look-ahead . . .
Design Decisions ALU bit slice 7-to-2 C/L 7 3-to-2 C/L PLD Gates CL0 CL6 mux Simple bit-slice big combinational problem many little combinational problems partition into 2-step problem Bit slice with carry look-ahead . . . 9/12/01 ©UCB Fall 2001

Refined Diagram: bit-slice ALU
32 A B 32 ALU0 a0 b0 m cin co s0 ALU31 a31 b31 m cin co s31 4 M Ovflw 32 S 9/12/01 ©UCB Fall 2001

7-to-2 Combinational Logic
start turning the crank . . . Function Inputs Outputs K-Map M0 M1 M2 M3 A B Cin S Cout add Just fill in all combinations that you want 127 9/12/01 ©UCB Fall 2001

Design trick 3: solve part of the problem and extend
Seven plus a MUX ? Design trick 2: take pieces you know (or can imagine) and try to put them together Design trick 3: solve part of the problem and extend A B 1-bit Full Adder CarryOut Mux CarryIn Result add and or S-select Now that I have shown you how to build a 1-bit full adder, we have all the major components needed for this 1-bit ALU. In order to build a 4-bit ALU, we simply connect four 1-bit ALUs in series to feed the CarryOut of one ALU to the CarryIn of the next ALU. Even though I called this an ALU, I actually lied a little. There is something missing about this ALU. This ALU can NOT perform the subtract operation. Let’s see how can we fix this problem. 2 min = 35 min. (Y:15) 9/12/01 ©UCB Fall 2001

Additional operations
A - B = A + (– B) = A + B + 1 form two complement by invert and add one S-select invert CarryIn and A or Result Mux add 1-bit Full Adder B CarryOut Set-less-than? – left as an exercise 9/12/01 ©UCB Fall 2001

LSB and MSB need to do a little extra
Revised Diagram LSB and MSB need to do a little extra 32 A B 32 a0 b0 a31 b31 4 ALU0 ALU0 M ? co cin co cin s0 s31 C/L to produce select, comp, c-in Ovflw 32 S 9/12/01 ©UCB Fall 2001

Overflow Examples: 7 + 3 = 10 but ... - 4 - 5 = - 9 but ... Decimal
Binary Decimal 2’s Complement 0000 0000 1 0001 -1 1111 2 0010 -2 1110 3 0011 -3 1101 4 0100 -4 1100 5 0101 -5 1011 6 0110 -6 1010 7 0111 -7 1001 -8 1000 Examples: = but ... = but ... Well so far so good but life is not always perfect. Let’s consider the case 7 plus 3, you will get 10. But if you perform the binary arithmetics on our 4-bit adder you will get 1010, which is negative 6. Similarly, if you try to add negative 4 and negative 5 together, you should get negative 9. But the binary arithmetics will give you 0111, which is 7. So what went wrong? The problem is overflow. The number you get are simply too big, in the positive 10 case, and too small in the negative 9 case, to be represented by four bits. +2 = 39 min. (Y:19) 1 1 1 1 1 1 1 7 1 1 – 4 3 – 5 + 1 1 + 1 1 1 1 1 – 6 1 1 1 7 9/12/01 ©UCB Fall 2001

Overflow Detection Overflow: the result is too large (or too small) to represent properly Example: - 8  4-bit binary number  7 When adding operands with different signs, overflow cannot occur! Overflow occurs when adding: 2 positive numbers and the sum is negative 2 negative numbers and the sum is positive On your own: Prove you can detect overflow by: Carry into MSB  Carry out of MSB Recalled from some earlier slides that the biggest positive number you can represent using 4-bit is 7 and the smallest negative you can represent is negative 8. So any time your addition results in a number bigger than 7 or less than negative 8, you have an overflow. Keep in mind is that whenever you try to add two numbers together that have different signs, that is adding a negative number to a positive number, overflow can NOT occur. Overflow occurs when you to add two positive numbers together and the sum has a negative sign. Or, when you try to add negative numbers together and the sum has a positive sign. If you spend some time, you can convince yourself that If the Carry into the most significant bit is NOT the same as the Carry coming out of the MSB, you have a overflow. +2 = 41 min. (Y:21) 1 1 1 1 1 1 1 7 1 1 –4 3 – 5 + 1 1 + 1 1 1 1 1 – 6 1 1 1 7 9/12/01 ©UCB Fall 2001

Overflow Detection Logic
Carry into MSB  Carry out of MSB For a N-bit ALU: Overflow = CarryIn[N - 1] XOR CarryOut[N - 1] CarryIn0 A0 1-bit ALU Result0 X Y X XOR Y B0 A1 B1 1-bit ALU Result1 CarryIn1 CarryOut1 CarryOut0 1 1 1 1 1 1 CarryIn2 A2 1-bit ALU Result2 Recall the XOR gate implements the not equal function: that is, its output is 1 only if the inputs have different values. Therefore all we need to do is connect the carry into the most significant bit and the carry out of the most significant bit to the XOR gate. Then the output of the XOR gate will give us the Overflow signal. +1 = 42 min. (Y:22) B2 CarryIn3 Overflow A3 1-bit ALU Result3 B3 CarryOut3 9/12/01 ©UCB Fall 2001

LSB and MSB need to do a little extra
More Revised Diagram LSB and MSB need to do a little extra 32 A B 32 signed-arith and cin xor co a0 b0 a31 b31 4 ALU0 ALU0 M co cin co cin s0 s31 C/L to produce select, comp, c-in Ovflw 32 S 9/12/01 ©UCB Fall 2001

But What about Performance?
Critical Path of n-bit Rippled-carry adder is n*CP A0 B0 1-bit ALU Result0 CarryIn0 CarryOut0 A1 B1 Result1 CarryIn1 CarryOut1 A2 B2 Result2 CarryIn2 CarryOut2 A3 B3 Result3 CarryIn3 CarryOut3 Design Trick: Throw hardware at it 9/12/01 ©UCB Fall 2001

Carry Look Ahead (Design trick: peek)
C0 = Cin A B C-out 0 0 0 “kill” 0 1 C-in “propagate” 1 0 C-in “propagate” 1 1 1 “generate” A0 B0 A1 B1 A2 B2 A3 B3 S G P C1 = G0 + C0  P0 G = A and B P = A xor B C2 = G1 + G0 P1 + C0  P0  P1 Names: suppose G0 is 1 => carry no matter what else => generates a carry suppose G0 =0 and P0=1 => carry IFF C0 is a 1 => propagates a carry Like dominoes What about more than 4 bits? C3 = G2 + G1 P2 + G0  P1  P2 + C0  P0  P1  P2 G P C4 = . . . 9/12/01 ©UCB Fall 2001

Cascaded Carry Look-ahead (16-bit): Abstraction
G0 P0 C1 = G0 + C0  P0 4-bit Adder C2 = G1 + G0 P1 + C0  P0  P1 4-bit Adder C3 = G2 + G1 P2 + G0  P1  P2 + C0  P0  P1  P2 4-bit Adder G P 9/12/01 ©UCB Fall 2001 C4 = . . .

Design Trick: Guess (or “Precompute”)
CP(2n) = 2*CP(n) n-bit adder n-bit adder CP(2n) = CP(n) + CP(mux) n-bit adder 1 n-bit adder n-bit adder Use multiplexor to save time: guess both ways and then select (assumes mux is faster than adder) Cout Carry-select adder 9/12/01 ©UCB Fall 2001

Carry Skip Adder: reduce worst case delay
B A4 B A0 4-bit Ripple Adder 4-bit Ripple Adder P3 S P3 S P2 P2 P1 P1 P0 P0 Just speed up the slowest case for each block Exercise: optimal design uses variable block sizes 9/12/01 ©UCB Fall 2001

Additional MIPS ALU requirements
Mult, MultU, Div, DivU (next lecture) => Need 32-bit multiply and divide, signed and unsigned Sll, Srl, Sra (next lecture) => Need left shift, right shift, right shift arithmetic by 0 to 31 bits Nor (leave as exercise to reader) => logical NOR or use 2 steps: (A OR B) XOR 9/12/01 ©UCB Fall 2001

Elements of the Design Process
Divide and Conquer (e.g., ALU) Formulate a solution in terms of simpler components. Design each of the components (subproblems) Generate and Test (e.g., ALU) Given a collection of building blocks, look for ways of putting them together that meets requirement Successive Refinement (e.g., carry lookahead) Solve "most" of the problem (i.e., ignore some constraints or special cases), examine and correct shortcomings. Formulate High-Level Alternatives (e.g., carry select) Articulate many strategies to "keep in mind" while pursuing any one approach. Work on the Things you Know How to Do The unknown will become “obvious” as you make progress. Here are some key elements of the design process. First is divide and conquer. (a) First you formulate a solution in terms of simpler components. (b) Then you concentrate on designing each components. Once you have the individual components built, you need to find a way to put them together to solve our original problem. Unless you are really good or really lucky, you probably won’t have a perfect solution the first time so you will need to apply successive refinement to your design. While you are pursuing any one approach, you need to keep alternate strategies in mind in case what you are pursuing does not work out. One of the most important advice I can give you is that work on the things you know how to do first. As you make forward progress, a lot of the unknowns will become clear. If you sit around and wait until you know everything before you start, you will never get anything done. +2 = 15 min. (X:55) 9/12/01 ©UCB Fall 2001

Summary of the Design Process
Hierarchical Design to manage complexity Top Down vs. Bottom Up vs. Successive Refinement Importance of Design Representations: Block Diagrams Decomposition into Bit Slices Truth Tables, K-Maps Circuit Diagrams Other Descriptions: state diagrams, timing diagrams, reg xfer, . . . Optimization Criteria: Gate Count [Package Count] top down bottom up mux design meets at TT This slide summaries some of the key points of the design process. Using a hierarchical design style is the best way to manage complexity because it allows you to ignore low level details while concentrating on the big picture. However, you cannot ignore the details forever. That’s why you need to use both the top-down and bottom-up strategies as you make successive refinements to your design. Some of the optimization criteria are: chip or broad area, pin count if you are designing a chip, delay, power, cost, and last but not least, the design time. As I pointed out at the last lecture, the computer market is so competitive that if your product is late by a year, you may fall way behind in the performance curve so design time can be one of the most important consideration. +2 = 57 min. (X:57) Area Logic Levels Fan-in/Fan-out Delay Power Pin Out Cost Design time 9/12/01 ©UCB Fall 2001

Why should you keep a design notebook?
Keep track of the design decisions and the reasons behind them Otherwise, it will be hard to debug and/or refine the design Write it down so that can remember in long project: 2 weeks ->2 yrs Others can review notebook to see what happened Record insights you have on certain aspect of the design as they come up Record of the different design & debug experiments Memory can fail when very tired Industry practice: learn from others mistakes Well, the goal of this part of the lecture is to convince EACH of you should keep your OWN design note book. Why? Well, first of all, you need to keep track of all the design decisions you made and may be more importantly, the reasons behind your design decisions. This may not be that important when your project life span is only a few weeks but after you graduate, you will work on projects that last for 2 to 3 years. And if you don’t write things down, you may not remember how you do certain things and why and you may find it very hard to debug and refine your design. Also, sometimes when you are working on certain part of the design, you may suddenly get some insights on another part of the design. You may not have time to follow up your insights immediately and if you don’t write them down, you may never be able to reconstruct them later when you have time. Finally, it is very important for you to write down everything you see on the tests or experiments you run when you are debugging your design. +2 = 59 min. (Y:39) 9/12/01 ©UCB Fall 2001

Why do we keep it on-line?
You need to force yourself to take notes Open a window and leave an editor running while you work 1) Acts as reminder to take notes 2) Makes it easy to take notes 1) + 2) => will actually do it Take advantage of the window system’s “cut and paste” features It is much easier to read your typing than your writing Also, paper log books have problems Limited capacity => end up with many books May not have right book with you at time vs. networked screens Can use computer to search files/index files to find what looking for The next question some of you may want to ask is, OK, I will keep a note book. But why should I keep it on line? Well, let’s be honest to ourselves. All of us need a little bit reminder to force ourselves to take notes while we work. One of the best reminder I find is the window system of modern workstation. By keeping an extra window open and have an editor running, it makes taking notes very easy and the editor also serves as a constant reminder for you to take notes. Also by keeping your notebook on-line, you can take advantage of the window system’s cut and paste feature to drop important “print outs” into your note book. Finally, although you may be able to read your own handwriting much better than anybody else, it is still easier to read your own typing than your own writing. +2 = 61 min. (Y:41) 9/12/01 ©UCB Fall 2001

Separate the entries by dates
How should you do it? Keep it simple DON’T make it so elaborate that you won’t use (fonts, layout, ...) Separate the entries by dates type “date” command in another window and cut&paste Start day with problems going to work on today Record output of simulation into log with cut&paste; add date May help sort out which version of simulation did what Record key with cut&paste Record of what works & doesn’t helps team decide what went wrong after you left Index: write a one-line summary of what you did at end of each day How should you keep your on-line notebook? By all means, Keep It Simple. The on-line notebook should help you trace down and solve your problems. It should NOT become one of your problems. In order to keep the note book easy to read, you should separate your entries by dates. Furthermore, before you sign off each date, we should write a one-line summary of what you did and this will serve as the index to your notebook. Let me show you some examples. +2 = 63 min. (Y:43) 9/12/01 ©UCB Fall 2001

On-line Notebook Example
Refer to the handout “Example of On-Line Log Book” on cs 152 home page Spend 10 minutes on the notebook example: 6 minutes per page. +12 = 75 min. (Y:55) 9/12/01 ©UCB Fall 2001

1st page of On-line notebook (Index + Wed. 9/6/95)
Wed Sep 6 00:47:28 PDT Created the 32-bit comparator component Thu Sep 7 14:02:21 PDT Tested the comparator Mon Sep 11 12:01:45 PDT Investigated bug found by Bart in comp32 and fixed it + ==================================================================== Wed Sep 6 00:47:28 PDT 1995 Goal: Layout the schematic for a 32-bit comparator I've layed out the schemtatics and made a symbol for the comparator. I named it comp32. The files are ~/wv/proj1/sch/comp32.sch ~/wv/proj1/sch/comp32.sym Wed Sep 6 02:29:22 PDT 1995 - ==================================================================== Add 1 line index at front of log file at end of each session: date+summary Start with date, time of day + goal Make comments during day, summary of work End with date, time of day (and add 1 line summary at front of file) 9/12/01 ©UCB Fall 2001

2nd page of On-line notebook (Thursday 9/7/95)
+ ==================================================================== Thu Sep 7 14:02:21 PDT 1995 Goal: Test the comparator component I've written a command file to test comp32. I've placed it in ~/wv/proj1/diagnostics/comp32.cmd. I ran the command file in viewsim and it looks like the comparator is working fine. I saved the output into a log file called ~/wv/proj1/diagnostics/comp32.log Notified the rest of the group that the comparator is done. Thu Sep 7 16:15:32 PDT 1995 - ==================================================================== 9/12/01 ©UCB Fall 2001

3rd page of On-line notebook (Monday 9/11/95)
+ =================================================================== = Mon Sep 11 12:01:45 PDT 1995 Goal: Investigate bug discovered in comp32 and hopefully fix it Bart found a bug in my comparator component. He left the following . From Sun Sep 10 01:47: Received: by wayne.manor (NX5.67e/NX3.0S) id AA00334; Sun, 10 Sep 95 01:47: Date: Wed, 10 Sep 95 01:47: From: Bart Simpson To: Subject: [cs152] bug in comp32 Status: R Hey Bruce, I think there's a bug in your comparator. The comparator seems to think that ffffffff and fffffff7 are equal. Can you take a look at this? Bart 9/12/01 ©UCB Fall 2001

4th page of On-line notebook (9/11/95 contd)
I verified the bug. here's a viewsim of the bug as it appeared.. (equal should be 0 instead of 1) SIM>stepsize 10ns SIM>v a_in A[31:0] SIM>v b_in B[31:0] SIM>w a_in b_in equal SIM>a a_in ffffffff\h SIM>a b_in fffffff7\h SIM>sim time = ns A_IN=FFFFFFFF\H B_IN=FFFFFFF7\H EQUAL=1 Simulation stopped at 10.0ns. Ah. I've discovered the bug. I mislabeled the 4th net in the comp32 schematic. I corrected the mistake and re-checked all the other labels, just in case. I re-ran the old diagnostic test file and tested it against the bug Bart found. It seems to be working fine. hopefully there aren’t any more bugs:) 9/12/01 ©UCB Fall 2001

5th page of On-line notebook (9/11/95 contd)
On second inspectation of the whole layout, I think I can remove one level of gates in the design and make it go faster. But who cares! the comparator is not in the critical path right now. the delay through the ALU is dominating the critical path. so unless the ALU gets a lot faster, we can live with a less than optimal comparator. I ed the group that the bug has been fixed Mon Sep 11 14:03:41 PDT 1995 - ================================================================ ==== Perhaps later critical path changes; what was idea to make compartor faster? Check log book! 9/12/01 ©UCB Fall 2001

Added benefit: cool post-design statistics
Sample graph from the Alewife project: For the Communications and Memory Management Unit (CMMU) These statistics came from on-line record of bugs 9/12/01 ©UCB Fall 2001

An Overview of the Design Process
Lecture Summary Cost and Price Die size determines chip cost: cost die size( +1) Cost v. Price: business model of company, pay for engineers R&D must return $8 to $14 for every $1 invester An Overview of the Design Process Design is an iterative process, multiple approaches to get started Do NOT wait until you know everything before you start Example: Instruction Set drives the ALU design On-line Design Notebook Open a window and keep an editor running while you work;cut&paste Refer to the handout as an example Former CS 152 students (and TAs) say they use on-line notebook for programming as well as hardware design; one of most valuable skills 9/12/01 ©UCB Fall 2001

CS152 Computer Architecture and Engineering Lecture 4 Cost and Design

Similar presentations

Presentation on theme: "CS152 Computer Architecture and Engineering Lecture 4 Cost and Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS152 Computer Architecture and Engineering Lecture 4 Cost and Design

Similar presentations

Presentation on theme: "CS152 Computer Architecture and Engineering Lecture 4 Cost and Design"— Presentation transcript:

Similar presentations

About project

Feedback